0

I have a dataframe of the following format. I want to add empty rows for missing time stamps for each customer.

+-------------+----------+------+----+----+
| Customer_ID | TimeSlot |  A1  | A2 | An |
+-------------+----------+------+----+----+
| c1          |        1 | 10.0 |  2 |  3 |
| c1          |        2 | 11   |  2 |  4 |
| c1          |        4 | 12   |  3 |  5 |
| c2          |        2 | 13   |  2 |  7 |
| c2          |        3 | 11   |  2 |  2 |
+-------------+----------+------+----+----+

The resulting table should be of the format

+-------------+----------+------+------+------+
| Customer_ID | TimeSlot |  A1  |  A2  |  An  |
+-------------+----------+------+------+------+
| c1          |        1 | 10.0 | 2    | 3    |
| c1          |        2 | 11   | 2    | 4    |
| c1          |        3 | null | null | null |
| c1          |        4 | 12   | 3    | 5    |
| c2          |        1 | null | null | null |
| c2          |        2 | 13   | 2    | 7    |
| c2          |        3 | 11   | 2    | 2    |
| c2          |        4 | null | null | null |
+-------------+----------+------+------+------+

I have 1 Million customers and 360(in the above example only 4 is depicted) Time slots. I figured out a way to create a Dataframe with 2 columns (Customer_id,Timeslot) with (1 M x 360 rows) and do a Left outer join with the original dataframe.

Is there a better way to do this?

2 Answers 2

4

You can express this as a SQL query:

select df.customerid, t.timeslot,
       t.A1, t.A2, t.An
from (select distinct customerid from df) c cross join
     (select distinct timeslot from df) t left join
     df
     on df.customerid = c.customerid and df.timeslot = t.timeslot;

Notes:

  • You should probably put this into another dataframe.
  • You might have tables with the available customers and/or timeslots. Use those instead of the subqueries.
Sign up to request clarification or add additional context in comments.

Comments

0

I think can used the answer of gordon linoff but you can add the following thinsg as you stated that there are millions of customer and you are performing join in them.

use tally table for TimeSlot?? because it might give a better performance. for more usabllity please refer the following link

http://www.sqlservercentral.com/articles/T-SQL/62867/

and I think you should use partition or row number function to divide you column customerid and select the customers based on some partition value. For example just select the row number values and then cross join with the tally table. it can imporove your performance .

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.