7

My data frame contains 10,000,000 rows! After group by, ~ 9,000,000 sub-frames remain to loop through.

The code is:

data = read.csv('big.csv')
for id, new_df in data.groupby(level=0): # look at mini df and do some analysis
    # some code for each of the small data frames

This is super inefficient, and the code has been running for 10+ hours now.

Is there a way to speed it up?

Full Code:

d = pd.DataFrame() # new df to populate
print 'Start of the loop'
for id, new_df in data.groupby(level=0):
    c = [new_df.iloc[i:] for i in range(len(new_df.index))]
    x = pd.concat(c, keys=new_df.index).reset_index(level=(2,3), drop=True).reset_index()
    x = x.set_index(['level_0','level_1', x.groupby(['level_0','level_1']).cumcount()])
    d = pd.concat([d, x])

To get the data:

data = pd.read_csv('https://raw.githubusercontent.com/skiler07/data/master/so_data.csv', index_col=0).set_index(['id','date'])

Note:

Most of id's will only have 1 date. This indicates only 1 visit. For id's with more visits, I would like to structure them in a 3d format e.g. store all of their visits in the 2nd dimension out of 3. The output is (id, visits, features)

9
  • 1
    The answer to this is specific to # some code for each of the small data frames. Do you have an example calculation you are performing, and perhaps some sample data so we can test / benchmark? Commented Mar 16, 2018 at 10:08
  • 1
    @jpp I will make an edit Commented Mar 16, 2018 at 10:09
  • Hard question, but if possible dask should help. Commented Mar 16, 2018 at 10:09
  • @jpp I will also create some sample data now Commented Mar 16, 2018 at 10:12
  • just out of interest, what kind of data are you dealing with? Commented Mar 16, 2018 at 10:25

3 Answers 3

3
+50

Here is one way to speed that up. This adds the desired new rows in some code which processes the rows directly. This saves the overhead of constantly constructing small dataframes. Your sample of 100,000 rows runs in a couple of seconds on my machine. While your code with only 10,000 rows of your sample data takes > 100 seconds. This seems to represent a couple of orders of magnitude improvement.

Code:

def make_3d(csv_filename):

    def make_3d_lines(a_df):
        a_df['depth'] = 0
        depth = 0
        prev = None
        accum = []
        for row in a_df.values.tolist():
            row[0] = 0
            key = row[1]
            if key == prev:
                depth += 1
                accum.append(row)
            else:
                if depth == 0:
                    yield row
                else:
                    depth = 0
                    to_emit = []
                    for i in range(len(accum)):
                        date = accum[i][2]
                        for j, r in enumerate(accum[i:]):
                            to_emit.append(list(r))
                            to_emit[-1][0] = j
                            to_emit[-1][2] = date
                    for r in to_emit[1:]:
                        yield r
                accum = [row]
            prev = key

    df_data = pd.read_csv('big-data.csv')
    df_data.columns = ['depth'] + list(df_data.columns)[1:]

    new_df = pd.DataFrame(
        make_3d_lines(df_data.sort_values('id date'.split())),
        columns=df_data.columns
    ).astype(dtype=df_data.dtypes.to_dict())

    return new_df.set_index('id date'.split())

Test Code:

start_time = time.time()
df = make_3d('big-data.csv')
print(time.time() - start_time)

df = df.drop(columns=['feature%d' % i for i in range(3, 25)])
print(df[df['depth'] != 0].head(10))

Results:

1.7390995025634766

                          depth  feature0  feature1  feature2
id              date                                         
207555809644681 20180104      1   0.03125  0.038623  0.008130
247833985674646 20180106      1   0.03125  0.004378  0.004065
252945024181083 20180107      1   0.03125  0.062836  0.065041
                20180107      2   0.00000  0.001870  0.008130
                20180109      1   0.00000  0.001870  0.008130
329567241731951 20180117      1   0.00000  0.041952  0.004065
                20180117      2   0.03125  0.003101  0.004065
                20180117      3   0.00000  0.030780  0.004065
                20180118      1   0.03125  0.003101  0.004065
                20180118      2   0.00000  0.030780  0.004065
Sign up to request clarification or add additional context in comments.

4 Comments

This seems to be the fastest (and correct) solution. For some reason the output doesn't match with my code in shape, but it seems to be doing exactly what I need. (I think I made errors myself, which I need to debug)
I found a couple of bugs in the code: 1) When an id only has 1 observation, and comes in the loop after id with depth>0, we have that row is not added to the yield. In a sense, all ids that come after the very first iteration, will be missing their first row. I can change to_emit[1:]: to 0 but this doesn't solve the very first statement.
To illustrate the problem, please run t = pd.DataFrame(data={'id':[1,1,1,1,2,2,3,4,4,5], 'date':[20180311,20180310,20180210,20170505,20180312,20180311,20180312,20180311,20180312, 20180315], 'feature1':[10,20,45,1,14,15,20,40,1,2],'result':[1,1,0,0,0,0,1,0,0,0]}) t = t.reindex(columns=['id','date','feature1','result']) make_3d(t). Note: in the make_3d() function, I added df_data = csv_filename.reset_index(). In this example, observation with id=3 is skipped (and 5, but 5 is last one)
I've posted my modifications in the answer below, in particular we should yield accum[0] instead of row, which seems to fix everything. Also not sure the use of r[0]=0.
2

I believe your approach for feature engineering could be done better, but I will stick to answering your question.

In Python, iterating over a Dictionary is way faster than iterating over a DataFrame

Here how I managed to process a huge pandas DataFrame (~100,000,000 rows):

# reset the Dataframe index to get level 0 back as a column in your dataset
df = data.reset_index()  # the index will be (id, date)

# split the DataFrame based on id
# and store the splits as Dataframes in a dictionary using id as key
d = dict(tuple(df.groupby('id')))

# iterate over the Dictionary and process the values
for key, value in d.items():

    pass  # each value is a Dataframe


# concat the values and get the original (processed) Dataframe back  
df2 = pd.concat(d.values(), ignore_index=True)

Comments

0

Modified @Stephen's code

def make_3d(dataset):

    def make_3d_lines(a_df):
        a_df['depth'] = 0 # sets all depth from (1 to n) to 0
        depth = 1 # initiate from 1, so that the first loop is correct
        prev = None
        accum = [] # accumulates blocks of data belonging to given user
        for row in a_df.values.tolist(): # for each row in our dataset
            row[0] = 0 # NOT SURE
            key = row[1] # this is the id of the row
            if key == prev: # if this rows id matches previous row's id, append together 
                depth += 1 
                accum.append(row)
            else: # else if this id is new, previous block is completed -> process it
                if depth == 0: # previous id appeared only once -> get that row from accum
                    yield accum[0] # also remember that depth = 0
                else: # process the block and emit each row
                    depth = 0
                    to_emit = [] # prepare to emit the list
                    for i in range(len(accum)): # for each unique day in the accumulated list
                        date = accum[i][2] # define date to be the first date it sees
                        for j, r in enumerate(accum[i:]):
                            to_emit.append(list(r))
                            to_emit[-1][0] = j # define the depth
                            to_emit[-1][2] = date # define the 
                    for r in to_emit[0:]:
                        yield r
                accum = [row]
            prev = key

    df_data = dataset.reset_index()
    df_data.columns = ['depth'] + list(df_data.columns)[1:]

    new_df = pd.DataFrame(
        make_3d_lines(df_data.sort_values('id date'.split(), ascending=[True,False])),
        columns=df_data.columns
    ).astype(dtype=df_data.dtypes.to_dict())

    return new_df.set_index('id date'.split())

Testing:

t = pd.DataFrame(data={'id':[1,1,1,1,2,2,3,3,4,5], 'date':[20180311,20180310,20180210,20170505,20180312,20180311,20180312,20180311,20170501,20180304], 'feature':[10,20,45,1,14,15,20,20,13,11],'result':[1,1,0,0,0,0,1,0,1,1]})
t = t.reindex(columns=['id','date','feature','result'])
print t 
              id     date      feature      result
0              1  20180311          10           1
1              1  20180310          20           1
2              1  20180210          45           0
3              1  20170505           1           0
4              2  20180312          14           0
5              2  20180311          15           0
6              3  20180312          20           1
7              3  20180311          20           0
8              4  20170501          13           1
9              5  20180304          11           1

Output

                        depth     feature      result
id            date                                   
1             20180311      0          10           1
              20180311      1          20           1
              20180311      2          45           0
              20180311      3           1           0
              20180310      0          20           1
              20180310      1          45           0
              20180310      2           1           0
              20180210      0          45           0
              20180210      1           1           0
              20170505      0           1           0
2             20180312      0          14           0
              20180312      1          15           0
              20180311      0          15           0
3             20180312      0          20           1
              20180312      1          20           0
              20180311      0          20           0
4             20170501      0          13           1

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.