1

I have a dataframe in the following format (actually ~200,000 rows. Of these ~20% are active - 'Y', the rest are 'N'):

active  adtype           body      eng          first scan   id
N       Private Seller Car   Coupe  8cyl 4.7L     31/01/2016  SSE-AD-3469148
Y       Dealer: Near New     Coupe  12cyl 6.5L    31/01/2016  OAG-AD-12326299
N       Dealer: Used Car     Coupe  12cyl 6.5L    31/01/2016  OAG-AD-6834787

I'm creating a list of the ids that are then cross-checked against some website scrape data to find new items:

database_ids = database_records['id'].tolist() #simple list of ad IDs from CSV
database_ids = set(database_ids)
database_dicts = database_records.to_dict(orient='records') #Converted to list of dicts  
newads = []
adscrape_ids = []

#Search database for existing ads. Append new ads to 'newads'
 for ad in adscrape:
     ad['last scan'] = date
     ad['active'] = 'Y'
     adscrape_ids.append(ad['id'])
     if ad['id'] not in database_ids:
         ad['first scan'] = date
         print 'new ad:',ad
         newads.append(ad)

I want to speed this process up by limiting database_ids to only ids that are still active ('Y'). Is there any pandas-specific, efficient way of doing this, or should I just create a loop:

for row in database_dicts:
    if row['active'] == 'Y':
        database_ids.append(row['id'])
database_ids = set(database_ids)

1 Answer 1

1

You can do this far more efficiently (I'm willing to bet you'll be able to see a noticeable difference in speed):

set(database_dicts[database_dicts.active == 'Y']['id'].unique())
  • database_dicts[database_dicts.active == 'Y'] filters and retains the rows you want.

  • .unique() will return the unique values (in this case, of the id column).

In general, you should try to do as much as possible while the data is in the DataFrame - it is much more efficient than loops and pure Python.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.