I have written the following code to preprocess a dataset like this:
StartLocation StartTime EndTime
school Mon Jul 25 19:04:30 GMT+01:00 2016 Mon Jul 25 19:04:33 GMT+01:00 2016
... ... ...
It contains a list of locations attended by a user with the start and end time. Each location may occur several times and there is no comprehensive list of locations. From this, I want to aggregate data for each location (frequency, total time, mean time). To do this I have written the following code:
def toEpoch(x):
try:
x = datetime.strptime(re.sub(r":(?=[^:]+$)", "", x), '%a %b %d %H:%M:%S %Z%z %Y').strftime('%s')
except:
x = datetime.strptime(x, '%a %b %d %H:%M:%S %Z %Y').strftime('%s')
x = (int(x)/60)
return x
#Preprocess data
df = pd.read_csv('...')
for index, row in df.iterrows():
df['StartTime'][index] = toEpoch(df['StartTime'][index])
df['EndTime'][index] = toEpoch(df['EndTime'][index])
df['TimeTaken'][index] = int(df['EndTime'][index]) - int(df['StartTime'][index])
total = df.groupby(df['StartLocation'].str.lower()).sum()
av = df.groupby(df['StartLocation'].str.lower()).mean()
count = df.groupby(df['StartLocation'].str.lower()).count()
output = pd.DataFrame({"location": total.index, 'total': total['TimeTaken'], 'mean': av['TimeTaken'], 'count': count['TimeTaken']})
print(output)
This code functions correctly, however is quite inefficient. How can I optimise the code?
EDIT: Based on @Batman's helpful comments I no longer iterate. However, I still hope to further optimise this if possible. The updated code is:
df = pd.read_csv('...')
df['StartTime'] = df['StartTime'].apply(toEpoch)
df['EndTime'] = df['EndTime'].apply(toEpoch)
df['TimeTaken'] = df['EndTime'] - df['StartTime']
total = df.groupby(df['StartLocation'].str.lower()).sum()
av = df.groupby(df['StartLocation'].str.lower()).mean()
count = df.groupby(df['StartLocation'].str.lower()).count()
output = pd.DataFrame({"location": total.index, 'total': total['TimeTaken'], 'mean': av['TimeTaken'], 'count': count['TimeTaken']})
print(output)

sum,meanandcount.str.lower()? do you really need regex ?applyis still iterating.