Looping through Pandas dataframe to generate list - most efficient way

Question

I have a dataframe in the following format (actually ~200,000 rows. Of these ~20% are active - 'Y', the rest are 'N'):

active  adtype           body      eng          first scan   id
N       Private Seller Car  Â Coupe Â 8cyl 4.7L     31/01/2016  SSE-AD-3469148
Y       Dealer: Near New    Â Coupe Â 12cyl 6.5L    31/01/2016  OAG-AD-12326299
N       Dealer: Used Car    Â Coupe Â 12cyl 6.5L    31/01/2016  OAG-AD-6834787

I'm creating a list of the ids that are then cross-checked against some website scrape data to find new items:

database_ids = database_records['id'].tolist() #simple list of ad IDs from CSV
database_ids = set(database_ids)
database_dicts = database_records.to_dict(orient='records') #Converted to list of dicts  
newads = []
adscrape_ids = []

#Search database for existing ads. Append new ads to 'newads'
 for ad in adscrape:
     ad['last scan'] = date
     ad['active'] = 'Y'
     adscrape_ids.append(ad['id'])
     if ad['id'] not in database_ids:
         ad['first scan'] = date
         print 'new ad:',ad
         newads.append(ad)

I want to speed this process up by limiting database_ids to only ids that are still active ('Y'). Is there any pandas-specific, efficient way of doing this, or should I just create a loop:

for row in database_dicts:
    if row['active'] == 'Y':
        database_ids.append(row['id'])
database_ids = set(database_ids)

Ami Tavory · Accepted Answer · 2016-02-07 13:59:05Z

1

You can do this far more efficiently (I'm willing to bet you'll be able to see a noticeable difference in speed):

set(database_dicts[database_dicts.active == 'Y']['id'].unique())

database_dicts[database_dicts.active == 'Y'] filters and retains the rows you want.
.unique() will return the unique values (in this case, of the id column).

In general, you should try to do as much as possible while the data is in the DataFrame - it is much more efficient than loops and pure Python.

edited Feb 7, 2016 at 13:59

answered Feb 7, 2016 at 13:50

Ami Tavory

76.7k13 gold badges152 silver badges196 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Looping through Pandas dataframe to generate list - most efficient way

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related