Spark dataframe add Missing Values

Question

I have a dataframe of the following format. I want to add empty rows for missing time stamps for each customer.

+-------------+----------+------+----+----+
| Customer_ID | TimeSlot |  A1  | A2 | An |
+-------------+----------+------+----+----+
| c1          |        1 | 10.0 |  2 |  3 |
| c1          |        2 | 11   |  2 |  4 |
| c1          |        4 | 12   |  3 |  5 |
| c2          |        2 | 13   |  2 |  7 |
| c2          |        3 | 11   |  2 |  2 |
+-------------+----------+------+----+----+

The resulting table should be of the format

+-------------+----------+------+------+------+
| Customer_ID | TimeSlot |  A1  |  A2  |  An  |
+-------------+----------+------+------+------+
| c1          |        1 | 10.0 | 2    | 3    |
| c1          |        2 | 11   | 2    | 4    |
| c1          |        3 | null | null | null |
| c1          |        4 | 12   | 3    | 5    |
| c2          |        1 | null | null | null |
| c2          |        2 | 13   | 2    | 7    |
| c2          |        3 | 11   | 2    | 2    |
| c2          |        4 | null | null | null |
+-------------+----------+------+------+------+

I have 1 Million customers and 360(in the above example only 4 is depicted) Time slots. I figured out a way to create a Dataframe with 2 columns (Customer_id,Timeslot) with (1 M x 360 rows) and do a Left outer join with the original dataframe.

Is there a better way to do this?

Gordon Linoff · Accepted Answer · 2016-12-13 12:21:57Z

4

You can express this as a SQL query:

select df.customerid, t.timeslot,
       t.A1, t.A2, t.An
from (select distinct customerid from df) c cross join
     (select distinct timeslot from df) t left join
     df
     on df.customerid = c.customerid and df.timeslot = t.timeslot;

Notes:

You should probably put this into another dataframe.
You might have tables with the available customers and/or timeslots. Use those instead of the subqueries.

answered Dec 13, 2016 at 12:21

Gordon Linoff

1.3m62 gold badges706 silver badges857 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Rohit Gupta · Accepted Answer · 2016-12-13 12:44:06Z

0

I think can used the answer of gordon linoff but you can add the following thinsg as you stated that there are millions of customer and you are performing join in them.

use tally table for TimeSlot?? because it might give a better performance. for more usabllity please refer the following link

http://www.sqlservercentral.com/articles/T-SQL/62867/

and I think you should use partition or row number function to divide you column customerid and select the customers based on some partition value. For example just select the row number values and then cross join with the tally table. it can imporove your performance .

answered Dec 13, 2016 at 12:44

Rohit Gupta

4554 silver badges16 bronze badges

Collectives™ on Stack Overflow

Spark dataframe add Missing Values

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related