Memory Error: Create multiple columns based on multiple column conditions from another dataframe

Question

As mentioned in the below post: Create multiple columns based on multiple column conditions from another dataframe

i was able to get the required output, however when i run my script with large files i get memory error, is there a way to overcome this memory error in the same solution as provided in the above post? if no, what would be the best way to achieve the result without encountering memory error

Adding the full details again:

I have 2 dataframes derived from csv files df1

 |BID    |Datetime           |TrId |Code|LineNumber|Vol  |Grade      |PId
0|1002867|2019-08-19 01:27:53|1459 |f   |10        |33.88|Vd         |4  
1|1002867|2019-08-19 01:39:05|1460 |f   |10        |18.13|EE         |5  
2|1002867|2019-08-19 01:39:55|1461 |f   |10        |21.8 |Ad         |9  
3|1002867|2019-08-19 01:39:55|1461 |f   |20        |500  |Vd         |10 
4|1002147|2019-08-19 01:26:21|2764 |f   |10        |33.86|V9         |3  
5|1002147|2019-10-19 01:31:57|2765 |f   |10        |3.48 |V9         |2  
9|1001257|2019-08-19 01:49:54|11524|f   |10        |19.93|Ul         |5

df2

 |sId  |BID    |StartDateTime      |EndDateTime        
0|10007|1002867|2019-07-26 05:11:05|2019-10-05 21:50:55
1|10006|1002147|2019-08-18 05:11:05|2019-10-05 21:50:55
2|10006|1002147|2019-10-05 21:50:55|2019-11-06 21:50:28
3|10006|1002147|2019-10-06 21:50:28|2019-10-08 03:56:20
4|10006|1002147|2019-10-08 03:56:20|2019-10-09 03:50:35
5|10006|1002147|2019-10-09 03:50:35|2019-10-10 05:12:30
6|10006|1002147|2019-10-10 05:12:30|2019-10-11 05:12:38
7|10009|1002348|2019-09-26 04:21:12|2019-10-06 04:16:00
8|10009|1002348|2019-10-06 04:16:00|2019-10-07 04:11:38
9|10009|1002348|2019-10-07 04:11:38|2019-10-08 04:13:12

Note that both dataframes are not of same length

I want to add the column sId, StartDateTime and EndDateTime from df2 to df1 only if the following conditions match:

if df1.BID = df2.BID and df1.DateTime is between df2.StartDateTime and df2.EndDatetime

My result should look like this:

 |BID    |Datetime           |TrId |Code|LineNumber|Vol  |Grade      |PId|sId  |StartDateTime      |EndDateTime        
0|1002867|2019-08-19 01:27:53|1459 |f   |10        |33.88|Vd         |4  |10007|2019-07-26 05:11:05|2019-10-05 21:50:55
1|1002867|2019-08-19 01:39:05|1460 |f   |10        |18.13|EE         |5  |10007|2019-07-26 05:11:05|2019-10-05 21:50:55
2|1002867|2019-08-19 01:39:55|1461 |f   |10        |21.8 |Ad         |9  |10007|2019-07-26 05:11:05|2019-10-05 21:50:55
3|1002867|2019-08-19 01:39:55|1461 |f   |20        |500  |Vd         |10 |10007|2019-07-26 05:11:05|2019-10-05 21:50:55
4|1002147|2019-08-19 01:26:21|2764 |f   |10        |33.86|V9         |3  |10006|2019-08-18 05:11:05|2019-10-05 21:50:55
5|1002147|2019-10-19 01:31:57|2765 |f   |10        |3.48 |V9         |2  |10006|2019-10-05 21:50:55|2019-11-06 21:50:28
9|1001257|2019-08-19 01:49:54|11524|f   |10        |19.93|Ul         |5  |NA   |NA                 |NA

I have tried using the method from this post: Create column based on multiple column conditions from another dataframe

however I get only the Site Id in my result and not the StartDateTime and EndDateTime How can i get these columns in my result

Tried code:

for key, grp in df2.groupby('sId'):
    cols = ['BID', 'StartDateTime', 'EndDateTime']
    masks = (df1['BID'].eq(bid) & df1['Datetime'].between(start, end) for bid, start, end in grp[cols].itertuples(index=False))
    df1.loc[pd.concat(masks, axis=1).any(1), 'sId'] = key

df1['sId'] = df1['sId'].fillna('NA')
print(df1)

This prints out only:

 |BID    |Datetime           |TrId |Code|LineNumber|Vol  |Grade      |PId|sId  
0|1002867|2019-08-19 01:27:53|1459 |f   |10        |33.88|Vd         |4  |10007
1|1002867|2019-08-19 01:39:05|1460 |f   |10        |18.13|EE         |5  |10007
2|1002867|2019-08-19 01:39:55|1461 |f   |10        |21.8 |Ad         |9  |10007
3|1002867|2019-08-19 01:39:55|1461 |f   |20        |500  |Vd         |10 |10007
4|1002147|2019-08-19 01:26:21|2764 |f   |10        |33.86|V9         |3  |10006
5|1002147|2019-10-19 01:31:57|2765 |f   |10        |3.48 |V9         |2  |10006
9|1001257|2019-08-19 01:49:54|11524|f   |10        |19.93|Ul         |5  |NA

As mentioned the following works with small set of data:

df3 = pd.merge(df1, df2, on='BID', how="left")
result = df3[df3['Datetime'].between(df3.StartDateTime, df3.EndDateTime) | df3.sId.isna()]

But using this with large file throws memory error

Can you share the data again, ideally all of it? Please also include all relevant code. See: minimal reproducible example. — AMC
– AMC, Commented Nov 26, 2019 at 6:13

pybeginner · Accepted Answer · 2019-11-27 01:09:41Z

0

Installed 64-Bit python and it resolved the issue

answered Nov 27, 2019 at 1:09

pybeginner

556 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Memory Error: Create multiple columns based on multiple column conditions from another dataframe

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related