0

I have the following dataframe. There are several ID's which have either a numeric or a string value. If the ID is a string_value like "B" the numeric_value is "NULL" as a string. Vice versa for the numeric value e.g. ID "D".

    ID  string_value    numeric_value   timestamp
0   B   On              NULL            1632733508
1   B   Off             NULL            1632733508
2   A   Inactive        NULL            1632733511
3   A   Active          NULL            1632733512
4   D   NULL            450             1632733513
5   D   NULL            431             1632733515
6   C   NULL            20              1632733518
7   C   NULL            30              1632733521

Now I want to seperate the dataframe in a new one for each ID by an existing list containing all the unique ID's. Afterwards the new dataframes like "B" in this example, should drop the column with the "NULL" values. So if B is a string_value the numeric_value should be dropped.

    ID  string_value    timestamp
0   B   On              1632733508
1   B   Off             1632733508

After that, the column with the value should be renamed with the ID "B" and the ID column should be dropped.

    B   timestamp
0   On  1632733508
1   Off 1632733508

As described, the same procedure should be applied for the numeric values in this case ID "D"

    ID  numeric_value   timestamp
0   D   450             1632733513
1   D   431             1632733515
    D   timestamp
0   450 1632733513
1   431 1632733515

It is important to safe the original data types within the value column.

1 Answer 1

1

Assuming your dataframe is called df and your list of IDs is ids. You can write a function which does what you need and call it for every id.

The function applies the required filter, and the selects the needed columns with the id as an alias.

from pyspark.sql import functions as f

ids = ["B", "A", "D", "C"]


def split_df(df, id):
    split_df = df.filter(f.col("ID") == id).select(
        f.coalesce(f.col("string_value"), f.col("numeric_value")).alias(id),
        f.col("timestamp"),
    )
    return split_df


dfs = [split_df(df, id) for id in ids]

for df in dfs:
    df.show()

output

+---+----------+                                                                
|  B| timestamp|
+---+----------+
| On|1632733508|
|Off|1632733508|
+---+----------+

+--------+----------+
|       A| timestamp|
+--------+----------+
|Inactive|1632733511|
|  Active|1632733512|
+--------+----------+

+---+----------+
|  D| timestamp|
+---+----------+
|450|1632733513|
|431|1632733515|
+---+----------+

+---+----------+
|  C| timestamp|
+---+----------+
| 20|1632733518|
| 30|1632733521|
+---+----------+
Sign up to request clarification or add additional context in comments.

5 Comments

Thanks for the answer! Which library do you use due to the variable 'f'?
The code works fine so far. However, when I output the schemas of the dfs elements, the original data type has been lost and everything is a "string". Why is this happening?
@Horseman added the import. The coalesce between a string and int return a string.
@Steven thank you! is there a workaround to save the types?
@Horseman ask ScootCork. That's his solution !

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.