Is there a way to speed up the following pandas for loop?

Question

My data frame contains 10,000,000 rows! After group by, ~ 9,000,000 sub-frames remain to loop through.

The code is:

data = read.csv('big.csv')
for id, new_df in data.groupby(level=0): # look at mini df and do some analysis
    # some code for each of the small data frames

This is super inefficient, and the code has been running for 10+ hours now.

Is there a way to speed it up?

Full Code:

d = pd.DataFrame() # new df to populate
print 'Start of the loop'
for id, new_df in data.groupby(level=0):
    c = [new_df.iloc[i:] for i in range(len(new_df.index))]
    x = pd.concat(c, keys=new_df.index).reset_index(level=(2,3), drop=True).reset_index()
    x = x.set_index(['level_0','level_1', x.groupby(['level_0','level_1']).cumcount()])
    d = pd.concat([d, x])

To get the data:

data = pd.read_csv('https://raw.githubusercontent.com/skiler07/data/master/so_data.csv', index_col=0).set_index(['id','date'])

Note:

Most of id's will only have 1 date. This indicates only 1 visit. For id's with more visits, I would like to structure them in a 3d format e.g. store all of their visits in the 2nd dimension out of 3. The output is (id, visits, features)

The answer to this is specific to # some code for each of the small data frames. Do you have an example calculation you are performing, and perhaps some sample data so we can test / benchmark? — jpp
– jpp, Commented Mar 16, 2018 at 10:08
just out of interest, what kind of data are you dealing with? — Patrick Artner
– Patrick Artner, Commented Mar 16, 2018 at 10:25

Stephen Rauch · Accepted Answer · 2018-03-18 23:12:11Z

3

+50

Here is one way to speed that up. This adds the desired new rows in some code which processes the rows directly. This saves the overhead of constantly constructing small dataframes. Your sample of 100,000 rows runs in a couple of seconds on my machine. While your code with only 10,000 rows of your sample data takes > 100 seconds. This seems to represent a couple of orders of magnitude improvement.

Code:

def make_3d(csv_filename):

    def make_3d_lines(a_df):
        a_df['depth'] = 0
        depth = 0
        prev = None
        accum = []
        for row in a_df.values.tolist():
            row[0] = 0
            key = row[1]
            if key == prev:
                depth += 1
                accum.append(row)
            else:
                if depth == 0:
                    yield row
                else:
                    depth = 0
                    to_emit = []
                    for i in range(len(accum)):
                        date = accum[i][2]
                        for j, r in enumerate(accum[i:]):
                            to_emit.append(list(r))
                            to_emit[-1][0] = j
                            to_emit[-1][2] = date
                    for r in to_emit[1:]:
                        yield r
                accum = [row]
            prev = key

    df_data = pd.read_csv('big-data.csv')
    df_data.columns = ['depth'] + list(df_data.columns)[1:]

    new_df = pd.DataFrame(
        make_3d_lines(df_data.sort_values('id date'.split())),
        columns=df_data.columns
    ).astype(dtype=df_data.dtypes.to_dict())

    return new_df.set_index('id date'.split())

Test Code:

start_time = time.time()
df = make_3d('big-data.csv')
print(time.time() - start_time)

df = df.drop(columns=['feature%d' % i for i in range(3, 25)])
print(df[df['depth'] != 0].head(10))

Results:

1.7390995025634766

                          depth  feature0  feature1  feature2
id              date                                         
207555809644681 20180104      1   0.03125  0.038623  0.008130
247833985674646 20180106      1   0.03125  0.004378  0.004065
252945024181083 20180107      1   0.03125  0.062836  0.065041
                20180107      2   0.00000  0.001870  0.008130
                20180109      1   0.00000  0.001870  0.008130
329567241731951 20180117      1   0.00000  0.041952  0.004065
                20180117      2   0.03125  0.003101  0.004065
                20180117      3   0.00000  0.030780  0.004065
                20180118      1   0.03125  0.003101  0.004065
                20180118      2   0.00000  0.030780  0.004065

edited Mar 18, 2018 at 23:12

answered Mar 18, 2018 at 22:55

Stephen Rauch♦

50.1k32 gold badges118 silver badges143 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

GRS Over a year ago

This seems to be the fastest (and correct) solution. For some reason the output doesn't match with my code in shape, but it seems to be doing exactly what I need. (I think I made errors myself, which I need to debug)

GRS Over a year ago

I found a couple of bugs in the code: 1) When an id only has 1 observation, and comes in the loop after id with depth>0, we have that row is not added to the yield. In a sense, all ids that come after the very first iteration, will be missing their first row. I can change to_emit[1:]: to 0 but this doesn't solve the very first statement.

GRS Over a year ago

To illustrate the problem, please run

t = pd.DataFrame(data={'id':[1,1,1,1,2,2,3,4,4,5], 'date':[20180311,20180310,20180210,20170505,20180312,20180311,20180312,20180311,20180312, 20180315], 'feature1':[10,20,45,1,14,15,20,40,1,2],'result':[1,1,0,0,0,0,1,0,0,0]}) t = t.reindex(columns=['id','date','feature1','result']) make_3d(t)

. Note: in the make_3d() function, I added df_data = csv_filename.reset_index(). In this example, observation with id=3 is skipped (and 5, but 5 is last one)

GRS Over a year ago

I've posted my modifications in the answer below, in particular we should yield accum[0] instead of row, which seems to fix everything. Also not sure the use of r[0]=0.

Abdul Rahman Bres · Accepted Answer · 2019-10-04 12:46:50Z

2

I believe your approach for feature engineering could be done better, but I will stick to answering your question.

In Python, iterating over a Dictionary is way faster than iterating over a DataFrame

Here how I managed to process a huge pandas DataFrame (~100,000,000 rows):

# reset the Dataframe index to get level 0 back as a column in your dataset
df = data.reset_index()  # the index will be (id, date)

# split the DataFrame based on id
# and store the splits as Dataframes in a dictionary using id as key
d = dict(tuple(df.groupby('id')))

# iterate over the Dictionary and process the values
for key, value in d.items():

    pass  # each value is a Dataframe


# concat the values and get the original (processed) Dataframe back  
df2 = pd.concat(d.values(), ignore_index=True)

edited Oct 4, 2019 at 12:46

answered Mar 19, 2018 at 0:45

Abdul Rahman Bres

2,9511 gold badge23 silver badges42 bronze badges

Comments

GRS · Accepted Answer · 2018-03-20 15:50:56Z

Modified @Stephen's code

def make_3d(dataset):

    def make_3d_lines(a_df):
        a_df['depth'] = 0 # sets all depth from (1 to n) to 0
        depth = 1 # initiate from 1, so that the first loop is correct
        prev = None
        accum = [] # accumulates blocks of data belonging to given user
        for row in a_df.values.tolist(): # for each row in our dataset
            row[0] = 0 # NOT SURE
            key = row[1] # this is the id of the row
            if key == prev: # if this rows id matches previous row's id, append together 
                depth += 1 
                accum.append(row)
            else: # else if this id is new, previous block is completed -> process it
                if depth == 0: # previous id appeared only once -> get that row from accum
                    yield accum[0] # also remember that depth = 0
                else: # process the block and emit each row
                    depth = 0
                    to_emit = [] # prepare to emit the list
                    for i in range(len(accum)): # for each unique day in the accumulated list
                        date = accum[i][2] # define date to be the first date it sees
                        for j, r in enumerate(accum[i:]):
                            to_emit.append(list(r))
                            to_emit[-1][0] = j # define the depth
                            to_emit[-1][2] = date # define the 
                    for r in to_emit[0:]:
                        yield r
                accum = [row]
            prev = key

    df_data = dataset.reset_index()
    df_data.columns = ['depth'] + list(df_data.columns)[1:]

    new_df = pd.DataFrame(
        make_3d_lines(df_data.sort_values('id date'.split(), ascending=[True,False])),
        columns=df_data.columns
    ).astype(dtype=df_data.dtypes.to_dict())

    return new_df.set_index('id date'.split())

Testing:

t = pd.DataFrame(data={'id':[1,1,1,1,2,2,3,3,4,5], 'date':[20180311,20180310,20180210,20170505,20180312,20180311,20180312,20180311,20170501,20180304], 'feature':[10,20,45,1,14,15,20,20,13,11],'result':[1,1,0,0,0,0,1,0,1,1]})
t = t.reindex(columns=['id','date','feature','result'])
print t 
              id     date      feature      result
0              1  20180311          10           1
1              1  20180310          20           1
2              1  20180210          45           0
3              1  20170505           1           0
4              2  20180312          14           0
5              2  20180311          15           0
6              3  20180312          20           1
7              3  20180311          20           0
8              4  20170501          13           1
9              5  20180304          11           1

Output

                        depth     feature      result
id            date                                   
1             20180311      0          10           1
              20180311      1          20           1
              20180311      2          45           0
              20180311      3           1           0
              20180310      0          20           1
              20180310      1          45           0
              20180310      2           1           0
              20180210      0          45           0
              20180210      1           1           0
              20170505      0           1           0
2             20180312      0          14           0
              20180312      1          15           0
              20180311      0          15           0
3             20180312      0          20           1
              20180312      1          20           0
              20180311      0          20           0
4             20170501      0          13           1

Collectives™ on Stack Overflow

Is there a way to speed up the following pandas for loop?

3 Answers 3

Code:

Test Code:

Results:

4 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Code:

Test Code:

Results:

4 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related