Optimise Python Code

Question

I have written the following code to preprocess a dataset like this:

StartLocation   StartTime   EndTime
school          Mon Jul 25 19:04:30 GMT+01:00 2016  Mon Jul 25 19:04:33 GMT+01:00 2016
...             ...         ...

It contains a list of locations attended by a user with the start and end time. Each location may occur several times and there is no comprehensive list of locations. From this, I want to aggregate data for each location (frequency, total time, mean time). To do this I have written the following code:

def toEpoch(x):
    try:
        x = datetime.strptime(re.sub(r":(?=[^:]+$)", "", x), '%a %b %d %H:%M:%S %Z%z %Y').strftime('%s')
    except:
        x = datetime.strptime(x, '%a %b %d %H:%M:%S %Z %Y').strftime('%s')
    x = (int(x)/60)
    return x

#Preprocess data
df = pd.read_csv('...')
for index, row in df.iterrows():
    df['StartTime'][index] = toEpoch(df['StartTime'][index])
    df['EndTime'][index] = toEpoch(df['EndTime'][index])
    df['TimeTaken'][index] = int(df['EndTime'][index]) - int(df['StartTime'][index])
total = df.groupby(df['StartLocation'].str.lower()).sum()
av = df.groupby(df['StartLocation'].str.lower()).mean()
count = df.groupby(df['StartLocation'].str.lower()).count()
output = pd.DataFrame({"location": total.index, 'total': total['TimeTaken'], 'mean': av['TimeTaken'], 'count': count['TimeTaken']})
print(output)

This code functions correctly, however is quite inefficient. How can I optimise the code?

EDIT: Based on @Batman's helpful comments I no longer iterate. However, I still hope to further optimise this if possible. The updated code is:

df = pd.read_csv('...')
df['StartTime'] = df['StartTime'].apply(toEpoch)
df['EndTime'] = df['EndTime'].apply(toEpoch)
df['TimeTaken'] = df['EndTime'] - df['StartTime']
total = df.groupby(df['StartLocation'].str.lower()).sum()
av = df.groupby(df['StartLocation'].str.lower()).mean()
count = df.groupby(df['StartLocation'].str.lower()).count()
output = pd.DataFrame({"location": total.index, 'total': total['TimeTaken'], 'mean': av['TimeTaken'], 'count': count['TimeTaken']})
print(output)

you should group only once and then get sum, mean and count — furas
– furas, Commented Jan 23, 2017 at 1:03
do you really need to .str.lower() ? do you really need regex ? — furas
– furas, Commented Jan 23, 2017 at 1:07
@furas The locations are manually entered so it's necessary and the regex is to deal with the unusual time stamp used. (See this) — user7347576
– user7347576, Commented Jan 23, 2017 at 1:12

Batman · Accepted Answer · 2017-01-23 01:21:21Z

2

First thing I'd do is stop iterating over the rows.

df['StartTime'] = df['StartTime'].apply(toEpoch)
df['EndTime'] = df['EndTime'].apply(toEpoch)
df['TimeTaken'] = df['EndTime'] - df['StartTime']

Then, do a single groupby operation.

gb = df.groupby('StartLocation')
total = gb.sum()
av = gb.mean()
count = gb.count()

edited Jan 23, 2017 at 1:21

answered Jan 23, 2017 at 1:04

Batman

9,0177 gold badges48 silver badges87 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

user7347576 Over a year ago

Am I also able to calculate the time taken without iterating?

furas Over a year ago

@user7347576 yes df['TimeTaken'] = df['EndTime'] - df['StartTime'] (if you have numbers in EndTime and StartTime)

user7347576 Over a year ago

@Batman Can I also lower all text to lowercase prior to grouping efficiently?

Batman Over a year ago

Sure. Use df['StartLocation'].apply(str.lower).

user7347576 Over a year ago

@Batman Am not sure why, but the lower doesn't always seem to work. In my final output it produces 'living room', 'Living room' and 'Living Room.' Any ideas why?

|

piRSquared · Accepted Answer · 2017-01-23 02:15:06Z

2

vectorize the date conversion
take the difference of two series of timestamps gives a series of timedeltas
use total_seconds to get the seconds from the the timedeltas
groupby with agg

# convert dates
cols = ['StartTime', 'EndTime']
df[cols] = pd.to_datetime(df[cols].stack()).unstack()

# generate timedelta then total_seconds via the `dt` accessor
df['TimeTaken'] = (df.EndTime - df.StartTime).dt.total_seconds()

# define the lower case version for cleanliness
loc_lower = df.StartLocation.str.lower()

# define `agg` functions for cleanliness
# this tells `groupby` to use 3 functions, sum, mean, and count
# it also tells what column names to use
funcs = dict(Total='sum', Mean='mean', Count='count')
df.groupby(loc_lower).TimeTaken.agg(funcs).reset_index()

explanation of date conversion

I define cols for convenience
df[cols] = is an assignment to those two columns
pd.to_datetime() is a vectorized date converter but only takes pd.Series not pd.DataFrame
df[cols].stack() makes the 2-column dataframe into a series, now ready for pd.to_datetime()
use pd.to_datetime(df[cols].stack()) as described and unstack() to get back my 2-columns and now ready to be assigned.

edited Jan 23, 2017 at 2:15

answered Jan 23, 2017 at 1:10

piRSquared

296k68 gold badges509 silver badges654 bronze badges

7 Comments

user7347576 Over a year ago

Could you please explain what this does?

piRSquared Over a year ago

@user7347576 explained :-)

user7347576 Over a year ago

@piRSqaured I don't mean to waste your time, but I still don't understand why this would be faster and where I would use it?

piRSquared Over a year ago

@user7347576 no worries. I left out details because I'm off to donate stuff at goodwill. I assumed you'd make a leap and see what to do. That's my fault. I'll show you what to do in an hour or so

piRSquared Over a year ago

@user7347576 there you go. Let me know if you have any other questions.

|

Collectives™ on Stack Overflow

Optimise Python Code

2 Answers 2

6 Comments

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related