Seperating string and numeric values in Pyspark

Question

I have the following dataframe. There are several ID's which have either a numeric or a string value. If the ID is a string_value like "B" the numeric_value is "NULL" as a string. Vice versa for the numeric value e.g. ID "D".

    ID  string_value    numeric_value   timestamp
0   B   On              NULL            1632733508
1   B   Off             NULL            1632733508
2   A   Inactive        NULL            1632733511
3   A   Active          NULL            1632733512
4   D   NULL            450             1632733513
5   D   NULL            431             1632733515
6   C   NULL            20              1632733518
7   C   NULL            30              1632733521

Now I want to seperate the dataframe in a new one for each ID by an existing list containing all the unique ID's. Afterwards the new dataframes like "B" in this example, should drop the column with the "NULL" values. So if B is a string_value the numeric_value should be dropped.

    ID  string_value    timestamp
0   B   On              1632733508
1   B   Off             1632733508

After that, the column with the value should be renamed with the ID "B" and the ID column should be dropped.

    B   timestamp
0   On  1632733508
1   Off 1632733508

As described, the same procedure should be applied for the numeric values in this case ID "D"

    ID  numeric_value   timestamp
0   D   450             1632733513
1   D   431             1632733515

    D   timestamp
0   450 1632733513
1   431 1632733515

It is important to safe the original data types within the value column.

Steven · Accepted Answer · 2021-11-16 13:00:52Z

1

Assuming your dataframe is called df and your list of IDs is ids. You can write a function which does what you need and call it for every id.

The function applies the required filter, and the selects the needed columns with the id as an alias.

from pyspark.sql import functions as f

ids = ["B", "A", "D", "C"]


def split_df(df, id):
    split_df = df.filter(f.col("ID") == id).select(
        f.coalesce(f.col("string_value"), f.col("numeric_value")).alias(id),
        f.col("timestamp"),
    )
    return split_df


dfs = [split_df(df, id) for id in ids]

for df in dfs:
    df.show()

output

+---+----------+                                                                
|  B| timestamp|
+---+----------+
| On|1632733508|
|Off|1632733508|
+---+----------+

+--------+----------+
|       A| timestamp|
+--------+----------+
|Inactive|1632733511|
|  Active|1632733512|
+--------+----------+

+---+----------+
|  D| timestamp|
+---+----------+
|450|1632733513|
|431|1632733515|
+---+----------+

+---+----------+
|  C| timestamp|
+---+----------+
| 20|1632733518|
| 30|1632733521|
+---+----------+

edited Nov 16, 2021 at 13:00

Steven

15.4k7 gold badges49 silver badges80 bronze badges

answered Nov 16, 2021 at 12:00

ScootCork

3,70616 silver badges23 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Horseman Over a year ago

Thanks for the answer! Which library do you use due to the variable 'f'?

Horseman Over a year ago

The code works fine so far. However, when I output the schemas of the dfs elements, the original data type has been lost and everything is a "string". Why is this happening?

Steven Over a year ago

@Horseman added the import. The coalesce between a string and int return a string.

Horseman Over a year ago

@Steven thank you! is there a workaround to save the types?

Steven Over a year ago

@Horseman ask ScootCork. That's his solution !

Collectives™ on Stack Overflow

Seperating string and numeric values in Pyspark

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related