Create several dataframes faster. For loop is too slow

Question

I'm trying to create several dataframes using the code below. My problem is the following, I have a list of names (lista_names), one dataframe (df1), and I would like to create one dataframe for each name in my list. In each of these new dataframes, one of the columns would be the Levenshtein distance between one name in my list and all names in the dataframe df1. Thus, in the end I would have n new dataframes, where n is the number of names in my list. Here is my code:

lev = pd.DataFrame({'Levenshtein':0,'n_ordem':0,'nome_ea':'a','nome_censo':'a'}, index = [1])

for i in range(0,len(lista_names)):
    for k in range(0,len(df1)):
        if isinstance(df1['nome_comp'][k],str):
            if Levenshtein.distance(lista_names[i], df1['nome_comp'][k])<=21:
                lev = lev.append({'Levenshtein':Levenshtein.distance(lista_names[i], df1['nome_comp'][k]),
                'n_ordem': df1['n_ordem'][k], 'nome_ea': lista_names[i],'nome_censo': df1['nome_comp'][k]}, 
                                 ignore_index = True)

lev.drop(0, axis=0, inplace = True)

lev.to_csv('levenshtein.csv')

Although this solution works, it is too slow and it fails to build the csv file even after 2 days running in my PC. Is there a way to make it faster?

Edit1: n=291

Ami Tavory · Accepted Answer · 2019-10-09 17:31:37Z

2

The problem is with the line

lev = lev.append({'Levenshtein':Levenshtein.distance(lista_names[i], df1['nome_comp'][k])

within the loop.

Pandas DataFrames are not designed for sequential insertion, and are very inefficient at that.

Instead, create a list of DataFrames levs, and append the DataFrame to it within the loop.

levs.append(pd.DataFrame(lev = lev.append({'Levenshtein':Levenshtein.distance(lista_names[i], df1['nome_comp'][k]),
            'n_ordem': df1['n_ordem'][k], 'nome_ea': lista_names[i],'nome_censo': df1['nome_comp'][k]})

When the loop is done, call pd.concat(levs). YMMV, but from similar cases I've had, it should be 10-200 times faster than your current code.

answered Oct 9, 2019 at 17:31

Ami Tavory

76.7k13 gold badges152 silver badges196 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Create several dataframes faster. For loop is too slow

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related