1

I have written the following code to preprocess a dataset like this:

StartLocation   StartTime   EndTime
school          Mon Jul 25 19:04:30 GMT+01:00 2016  Mon Jul 25 19:04:33 GMT+01:00 2016
...             ...         ...

It contains a list of locations attended by a user with the start and end time. Each location may occur several times and there is no comprehensive list of locations. From this, I want to aggregate data for each location (frequency, total time, mean time). To do this I have written the following code:

def toEpoch(x):
    try:
        x = datetime.strptime(re.sub(r":(?=[^:]+$)", "", x), '%a %b %d %H:%M:%S %Z%z %Y').strftime('%s')
    except:
        x = datetime.strptime(x, '%a %b %d %H:%M:%S %Z %Y').strftime('%s')
    x = (int(x)/60)
    return x

#Preprocess data
df = pd.read_csv('...')
for index, row in df.iterrows():
    df['StartTime'][index] = toEpoch(df['StartTime'][index])
    df['EndTime'][index] = toEpoch(df['EndTime'][index])
    df['TimeTaken'][index] = int(df['EndTime'][index]) - int(df['StartTime'][index])
total = df.groupby(df['StartLocation'].str.lower()).sum()
av = df.groupby(df['StartLocation'].str.lower()).mean()
count = df.groupby(df['StartLocation'].str.lower()).count()
output = pd.DataFrame({"location": total.index, 'total': total['TimeTaken'], 'mean': av['TimeTaken'], 'count': count['TimeTaken']})
print(output)

This code functions correctly, however is quite inefficient. How can I optimise the code?

EDIT: Based on @Batman's helpful comments I no longer iterate. However, I still hope to further optimise this if possible. The updated code is:

df = pd.read_csv('...')
df['StartTime'] = df['StartTime'].apply(toEpoch)
df['EndTime'] = df['EndTime'].apply(toEpoch)
df['TimeTaken'] = df['EndTime'] - df['StartTime']
total = df.groupby(df['StartLocation'].str.lower()).sum()
av = df.groupby(df['StartLocation'].str.lower()).mean()
count = df.groupby(df['StartLocation'].str.lower()).count()
output = pd.DataFrame({"location": total.index, 'total': total['TimeTaken'], 'mean': av['TimeTaken'], 'count': count['TimeTaken']})
print(output)
4
  • you should group only once and then get sum, mean and count Commented Jan 23, 2017 at 1:03
  • do you really need to .str.lower() ? do you really need regex ? Commented Jan 23, 2017 at 1:07
  • @furas The locations are manually entered so it's necessary and the regex is to deal with the unusual time stamp used. (See this) Commented Jan 23, 2017 at 1:12
  • using apply is still iterating. Commented Jan 23, 2017 at 1:24

2 Answers 2

2

First thing I'd do is stop iterating over the rows.

df['StartTime'] = df['StartTime'].apply(toEpoch)
df['EndTime'] = df['EndTime'].apply(toEpoch)
df['TimeTaken'] = df['EndTime'] - df['StartTime']

Then, do a single groupby operation.

gb = df.groupby('StartLocation')
total = gb.sum()
av = gb.mean()
count = gb.count()
Sign up to request clarification or add additional context in comments.

6 Comments

Am I also able to calculate the time taken without iterating?
@user7347576 yes df['TimeTaken'] = df['EndTime'] - df['StartTime'] (if you have numbers in EndTime and StartTime)
@Batman Can I also lower all text to lowercase prior to grouping efficiently?
Sure. Use df['StartLocation'].apply(str.lower).
@Batman Am not sure why, but the lower doesn't always seem to work. In my final output it produces 'living room', 'Living room' and 'Living Room.' Any ideas why?
|
2
  • vectorize the date conversion
  • take the difference of two series of timestamps gives a series of timedeltas
  • use total_seconds to get the seconds from the the timedeltas
  • groupby with agg

# convert dates
cols = ['StartTime', 'EndTime']
df[cols] = pd.to_datetime(df[cols].stack()).unstack()

# generate timedelta then total_seconds via the `dt` accessor
df['TimeTaken'] = (df.EndTime - df.StartTime).dt.total_seconds()

# define the lower case version for cleanliness
loc_lower = df.StartLocation.str.lower()

# define `agg` functions for cleanliness
# this tells `groupby` to use 3 functions, sum, mean, and count
# it also tells what column names to use
funcs = dict(Total='sum', Mean='mean', Count='count')
df.groupby(loc_lower).TimeTaken.agg(funcs).reset_index()

enter image description here


explanation of date conversion

  • I define cols for convenience
  • df[cols] = is an assignment to those two columns
  • pd.to_datetime() is a vectorized date converter but only takes pd.Series not pd.DataFrame
  • df[cols].stack() makes the 2-column dataframe into a series, now ready for pd.to_datetime()
  • use pd.to_datetime(df[cols].stack()) as described and unstack() to get back my 2-columns and now ready to be assigned.

7 Comments

Could you please explain what this does?
@user7347576 explained :-)
@piRSqaured I don't mean to waste your time, but I still don't understand why this would be faster and where I would use it?
@user7347576 no worries. I left out details because I'm off to donate stuff at goodwill. I assumed you'd make a leap and see what to do. That's my fault. I'll show you what to do in an hour or so
@user7347576 there you go. Let me know if you have any other questions.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.