I have a dataframe in the following format (actually ~200,000 rows. Of these ~20% are active - 'Y', the rest are 'N'):
active adtype body eng first scan id
N Private Seller Car  Coupe  8cyl 4.7L 31/01/2016 SSE-AD-3469148
Y Dealer: Near New  Coupe  12cyl 6.5L 31/01/2016 OAG-AD-12326299
N Dealer: Used Car  Coupe  12cyl 6.5L 31/01/2016 OAG-AD-6834787
I'm creating a list of the ids that are then cross-checked against some website scrape data to find new items:
database_ids = database_records['id'].tolist() #simple list of ad IDs from CSV
database_ids = set(database_ids)
database_dicts = database_records.to_dict(orient='records') #Converted to list of dicts
newads = []
adscrape_ids = []
#Search database for existing ads. Append new ads to 'newads'
for ad in adscrape:
ad['last scan'] = date
ad['active'] = 'Y'
adscrape_ids.append(ad['id'])
if ad['id'] not in database_ids:
ad['first scan'] = date
print 'new ad:',ad
newads.append(ad)
I want to speed this process up by limiting database_ids to only ids that are still active ('Y'). Is there any pandas-specific, efficient way of doing this, or should I just create a loop:
for row in database_dicts:
if row['active'] == 'Y':
database_ids.append(row['id'])
database_ids = set(database_ids)