Is there any efficient way to replace loc[[bla]] in pandas?

Ask Question

Asked 1 year, 10 months ago

Modified 1 year, 9 months ago

Viewed 99 times

I have a dataframe in pandas (version 1.5.3) and I want to select the records by an index and go through them in a loop. Before I was using df_info = df.loc[[idx]], whose return is a dataframe with the selected rows. However, this process runs MANY times and I noticed that this line specifically is taking a lot of time. Using cProfile, I saw that most of the time is related to 'get_indexer_non_unique' of 'pandas._libs.index.IndexEngine'. How to do this in a more efficient way?

An example of what the code looks like:

import pandas as pd
import cProfile
from tqdm import tqdm


def iterate_through_df():
    indexes = df.index.unique()
    for idx in tqdm(indexes):            
        df_info = df.loc[[idx]]
        #The code continues...

df = pd.read_csv('random_data.csv', index_col='id')
cProfile.run('iterate_through_df()', sort='cumulative')

I made the csv for testing with this code:

import pandas as pd
import numpy as np

size = 100000
num_columns = 100

data = {}
for i in range(1, num_columns + 1):
    key = f'name{i}:'
    data[key] = np.random.choice(['Alice', 'Bob', 'Charlie', 'David', 'Eva'], size=size)
random_indexes = np.random.randint(1, 100, size=size)
df = pd.DataFrame(data, index=random_indexes)
df.to_csv('random_data.csv', index_label='id')

most of the time spent is with this line (which I wanted to optimize):

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
99    1.046    0.011    1.046    0.011 {method 'get_indexer_non_unique' of 'pandas._libs.index.IndexEngine' objects}

I tried df_info = df_link.loc[idx]. The execution time was indeed shorter but the problem is that sometimes the return is a pandas Series object (when there is just one record for that index), sometimes it is a dataframe (when there is more than one record), and I need it to always be a dataframe.

edited Feb 5, 2024 at 20:53

asked Feb 1, 2024 at 13:23

Carlos Eduardo de Schuller Ban

715 bronze badges

2

You should really avoid doing that but vectorize the code. See How to iterate over rows in a Pandas DataFrame? . If you cannot, then you can convert the array to a list so accesses are faster but this will be still clearly sub-optimal (note that lists takes significantly more memory and the conversion can be slow).

Jérôme Richard
– Jérôme Richard

2024-02-01 13:51:37 +00:00
Commented Feb 1, 2024 at 13:51
3

This looks like a XY problem, please explain what you're trying to do with a minimal reproducible example. We might be able to help with a better strategy.

mozway
– mozway

2024-02-01 14:13:09 +00:00
Commented Feb 1, 2024 at 14:13
@mozway thanks for the comments. I tried to rewrite the question.

Carlos Eduardo de Schuller Ban
– Carlos Eduardo de Schuller Ban

2024-02-05 20:28:13 +00:00
Commented Feb 5, 2024 at 20:28
If I am not mistaken your index column in 'CSV' is in a random order. Sorting df by that value should speed up both indexing and a look up. Which btw is not needed: just pack the tqdm(indexes) in a tmp_df, sort and join with df. Take a look at 'duckdb' python library to do it in trivial SQL and multiththreded

darked89
– darked89

2024-02-05 21:15:21 +00:00
Commented Feb 5, 2024 at 21:15

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Is there any efficient way to replace loc[[bla]] in pandas?

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked