Improve Row Append Performance On Pandas DataFrames

Question

I am running a basic script that loops over a nested dictionary, grabs data from each record, and appends it to a Pandas DataFrame. The data looks something like this:

data = {"SomeCity": {"Date1": {record1, record2, record3, ...}, "Date2": {}, ...}, ...}

In total it has a few million records. The script itself looks like this:

city = ["SomeCity"]
df = DataFrame({}, columns=['Date', 'HouseID', 'Price'])
for city in cities:
    for dateRun in data[city]:
        for record in data[city][dateRun]:
            recSeries = Series([record['Timestamp'], 
                                record['Id'], 
                                record['Price']],
                                index = ['Date', 'HouseID', 'Price'])
            FredDF = FredDF.append(recSeries, ignore_index=True)

This runs painfully slow, however. Before I look for a way to parallelize it, I just want to make sure I'm not missing something obvious that would make this perform faster as it is, as I'm still quite new to Pandas.

Appending rows to DataFrames is inherently inefficient. Try to create the entire DataFrame with its final size in one go. As EdChum says, in this case you can probably do this with from_dict. — BrenBarn
– BrenBarn, Commented Jan 13, 2015 at 19:03

P-S · Accepted Answer · 2021-01-25 10:38:29Z

80

+500

I also used the dataframe's append function inside a loop and I was perplexed how slow it ran.

A useful example for those who are suffering, based on the correct answer on this page.

Python version: 3

Pandas version: 0.20.3

# the dictionary to pass to pandas dataframe
d = {}

# a counter to use to add entries to "dict"
i = 0 

# Example data to loop and append to a dataframe
data = [{"foo": "foo_val_1", "bar": "bar_val_1"}, 
       {"foo": "foo_val_2", "bar": "bar_val_2"}]

# the loop
for entry in data:

    # add a dictionary entry to the final dictionary
    d[i] = {"col_1_title": entry['foo'], "col_2_title": entry['bar']}
    
    # increment the counter
    i = i + 1

# create the dataframe using 'from_dict'
# important to set the 'orient' parameter to "index" to make the keys as rows
df = DataFrame.from_dict(d, "index")

The "from_dict" function: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.from_dict.html

edited Jan 25, 2021 at 10:38

answered Apr 30, 2018 at 17:50

P-S

4,0541 gold badge33 silver badges27 bronze badges

Sign up to request clarification or add additional context in comments.

11 Comments

egor.ananyev Over a year ago

This example is definitely very helpful!

MeteHan Over a year ago

This is a fast approach for sure but since default dictionary of Python is not ordered data in excel might get mixed randomly. I highly recommend using the OrderedDict library from collections.

Manjunath K Mayya Over a year ago

This is really quick. Operation taking around 20 seconds now completes within milliseconds. Thanks a ton :)

Shuo Over a year ago

Great tip. Incredibly useful. For my use case, I went from 45+ minutes down to less than 5 by using this method.

Arthur Zopellaro Over a year ago

I dropped from nearly 2h to less than 5 seconds xD THANKS!

|

jottbe · Accepted Answer · 2019-09-15 10:39:47Z

13

Appending rows to lists is far more efficient than to a DataFrame. Hence you would want to

append the rows to a list.
Then convert it into DataFrame and
set the index as required.

edited Sep 15, 2019 at 10:39

jottbe

4,5464 gold badges19 silver badges36 bronze badges

answered Sep 15, 2019 at 2:34

Mahidhar Surapaneni

1391 silver badge2 bronze badges

2 Comments

mhellmeier Over a year ago

Great and simple solution! For everyone searching for the implementation of step 2: Simple do df = pd.DataFrame(my_list, columns=['col1', 'col2']).

Ali ZareShahi Over a year ago

It's really good simple solution, i hope you go to ArbaeenWalk for prize <3

libertasT · Accepted Answer · 2020-05-15 10:30:16Z

Another way is to make it into a list and then use pd.concat

import pandas as pd 

df = pd.DataFrame({'num_legs': [2, 4, 8, 0],

                   'num_wings': [2, 0, 0, 0],

                   'num_specimen_seen': [10, 2, 1, 8]},

                  index=['falcon', 'dog', 'spider', 'fish'])

def append(df):
    df_out = df.copy()
    for i in range(1000):
        df_out = df_out.append(df)
    return df_out

def concat(df):
    df_list = []
    for i in range(1001):
        df_list.append(df)

    return pd.concat(df_list)


# some testing
df2 = concat(df)
df3 = append(df)

pd.testing.assert_frame_equal(df2,df3)

%timeit concat(df):

20.2 ms ± 794 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit append(df)

275 ms ± 2.54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

It is the recommended way to concatenate rows in pandas now:

Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once. link

Rob · Accepted Answer · 2019-12-10 20:30:11Z

I think the best way to do it is, if you know the data you are going to receive, allocate before hand.

import numpy as np
import pandas as pd

random_matrix = np.random.randn(100, 100)
insert_df = pd.DataFrame(random_matrix)

df = pd.DataFrame(columns=range(100), index=range(200))
df.loc[range(100), df.columns] = random_matrix
df.loc[range(100, 200), df.columns] = random_matrix

This is the pattern that I think makes the most sense. append will be faster if you have a very small dataframe, but it doesn't scale.

In [1]: import numpy as np; import pandas as pd

In [2]: random_matrix = np.random.randn(100, 100)
   ...: insert_df = pd.DataFrame(random_matrix)
   ...: df = pd.DataFrame(np.random.randn(100, 100))

In [2]: %timeit df.append(insert_df)
272 µs ± 2.36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [3]: %timeit df.loc[range(100), df.columns] = random_matrix
493 µs ± 4.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [4]: %timeit df.loc[range(100), df.columns] = insert_df
821 µs ± 8.68 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

When we run this with a 100,000 row dataframe, we see much more dramatic results.

In [1]: df = pd.DataFrame(np.random.randn(100_000, 100))

In [2]: %timeit df.append(insert_df)
17.9 ms ± 253 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [3]: %timeit df.loc[range(100), df.columns] = random_matrix
465 µs ± 13.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [4]: %timeit df.loc[range(99_900, 100_000), df.columns] = random_matrix
465 µs ± 5.75 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [5]: %timeit df.loc[range(99_900, 100_000), df.columns] = insert_df
1.02 ms ± 3.42 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So we can see an append is about 17 times slower than an insert with a dataframe, and 35 times slower than an insert with a numpy array.

Ryan Sheftel · Accepted Answer · 2016-09-05 21:45:50Z

5

I ran into a similar problem where I had to append many times to a DataFrame, but did not know the values in advance of the appends. I wrote a lightweight DataFrame like data structure that is just blists() under the hood. I use that to accumulate all of the data and then when it is complete transform the output into a Pandas DataFrame. Here is a link to my project, all open source so I hope it helps others:

https://pypi.python.org/pypi/raccoon

answered Sep 5, 2016 at 21:45

Ryan Sheftel

4865 silver badges4 bronze badges

1 Comment

WestCoastProjects Over a year ago

nice library - adding it to my core mvp's

wfbarksdale · Accepted Answer · 2021-08-08 21:27:02Z

1

In my case I was loading a large number of dataframes with the same columns from different files and wanted to append them to create one large data frame.

My solution was to first load all the dataframes into a list, and then use

all_dfs = []
for i in all_files:
  all_dfs.append(/* load df from file */)

master_df = pd.concat(all_dfs, ignore_index=True)

answered Aug 8, 2021 at 21:27

wfbarksdale

7,62615 gold badges68 silver badges90 bronze badges

1 Comment

johngreen Over a year ago

This worked really well for me.

Qinghua · Accepted Answer · 2022-11-20 18:45:01Z

-2

N=100000

t0=time.time()
d=[]
for i in range(N):
    d.append([i, i+1,i+2,i+3,i+0.1,1+0.2])
testdf=pd.DataFrame.from_records(d, columns=["x1","x2","x3","x4", "x5", "x6"])
print(time.time()-t0)

t0=time.time()
d={}
for i in range(N):
    d[len(d)+1]={"x1":i, "x2":i+1, "x3":i+2,"x4":i+3,"x5":i+0.1,"x6":1+0.2}
testdf=pd.DataFrame.from_dict(d, "index")
print(time.time()-t0)

t0=time.time()
testdf=pd.DataFrame()
for i in range(N):
    testdf=testdf.append({"x1":i, "x2":i+1, "x3":i+2,"x4":i+3,"x5":i+0.1,"x6":1+0.2}, ignore_index=True)
print(time.time()-t0)


=== result for N=10000 ===
list:0.016329050064086914
dict:0.03952217102050781
DataFrame:10.598219871520996

=== result for N=100000 ===
list: 0.4076499938964844
dict: 0.45696187019348145
DataFrame: 187.6609809398651

edited Nov 20, 2022 at 18:45

answered Nov 20, 2022 at 18:40

Qinghua

11 bronze badge

1 Comment

Community Over a year ago

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.

Collectives™ on Stack Overflow

Improve Row Append Performance On Pandas DataFrames

7 Answers 7

11 Comments

2 Comments

Comments

Comments

1 Comment

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

11 Comments

2 Comments

Comments

Comments

1 Comment

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related