1

I have a dataframe in pandas (version 1.5.3) and I want to select the records by an index and go through them in a loop. Before I was using df_info = df.loc[[idx]], whose return is a dataframe with the selected rows. However, this process runs MANY times and I noticed that this line specifically is taking a lot of time. Using cProfile, I saw that most of the time is related to 'get_indexer_non_unique' of 'pandas._libs.index.IndexEngine'. How to do this in a more efficient way?

An example of what the code looks like:

import pandas as pd
import cProfile
from tqdm import tqdm


def iterate_through_df():
    indexes = df.index.unique()
    for idx in tqdm(indexes):            
        df_info = df.loc[[idx]]
        #The code continues...

df = pd.read_csv('random_data.csv', index_col='id')
cProfile.run('iterate_through_df()', sort='cumulative')

I made the csv for testing with this code:

import pandas as pd
import numpy as np

size = 100000
num_columns = 100

data = {}
for i in range(1, num_columns + 1):
    key = f'name{i}:'
    data[key] = np.random.choice(['Alice', 'Bob', 'Charlie', 'David', 'Eva'], size=size)
random_indexes = np.random.randint(1, 100, size=size)
df = pd.DataFrame(data, index=random_indexes)
df.to_csv('random_data.csv', index_label='id')

most of the time spent is with this line (which I wanted to optimize):

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
99    1.046    0.011    1.046    0.011 {method 'get_indexer_non_unique' of 'pandas._libs.index.IndexEngine' objects}

I tried df_info = df_link.loc[idx]. The execution time was indeed shorter but the problem is that sometimes the return is a pandas Series object (when there is just one record for that index), sometimes it is a dataframe (when there is more than one record), and I need it to always be a dataframe.

4
  • 2
    You should really avoid doing that but vectorize the code. See How to iterate over rows in a Pandas DataFrame? . If you cannot, then you can convert the array to a list so accesses are faster but this will be still clearly sub-optimal (note that lists takes significantly more memory and the conversion can be slow). Commented Feb 1, 2024 at 13:51
  • 3
    This looks like a XY problem, please explain what you're trying to do with a minimal reproducible example. We might be able to help with a better strategy. Commented Feb 1, 2024 at 14:13
  • @mozway thanks for the comments. I tried to rewrite the question. Commented Feb 5, 2024 at 20:28
  • If I am not mistaken your index column in 'CSV' is in a random order. Sorting df by that value should speed up both indexing and a look up. Which btw is not needed: just pack the tqdm(indexes) in a tmp_df, sort and join with df. Take a look at 'duckdb' python library to do it in trivial SQL and multiththreded Commented Feb 5, 2024 at 21:15

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.