I have a dataframe in pandas (version 1.5.3) and I want to select the records by an index and go through them in a loop. Before I was using df_info = df.loc[[idx]], whose return is a dataframe with the selected rows. However, this process runs MANY times and I noticed that this line specifically is taking a lot of time. Using cProfile, I saw that most of the time is related to 'get_indexer_non_unique' of 'pandas._libs.index.IndexEngine'. How to do this in a more efficient way?
An example of what the code looks like:
import pandas as pd
import cProfile
from tqdm import tqdm
def iterate_through_df():
indexes = df.index.unique()
for idx in tqdm(indexes):
df_info = df.loc[[idx]]
#The code continues...
df = pd.read_csv('random_data.csv', index_col='id')
cProfile.run('iterate_through_df()', sort='cumulative')
I made the csv for testing with this code:
import pandas as pd
import numpy as np
size = 100000
num_columns = 100
data = {}
for i in range(1, num_columns + 1):
key = f'name{i}:'
data[key] = np.random.choice(['Alice', 'Bob', 'Charlie', 'David', 'Eva'], size=size)
random_indexes = np.random.randint(1, 100, size=size)
df = pd.DataFrame(data, index=random_indexes)
df.to_csv('random_data.csv', index_label='id')
most of the time spent is with this line (which I wanted to optimize):
ncalls tottime percall cumtime percall filename:lineno(function)
99 1.046 0.011 1.046 0.011 {method 'get_indexer_non_unique' of 'pandas._libs.index.IndexEngine' objects}
I tried df_info = df_link.loc[idx]. The execution time was indeed shorter but the problem is that sometimes the return is a pandas Series object (when there is just one record for that index), sometimes it is a dataframe (when there is more than one record), and I need it to always be a dataframe.
tqdm(indexes)in a tmp_df, sort and join withdf. Take a look at 'duckdb' python library to do it in trivial SQL and multiththreded