I have a Pyspark dataframe and I would like to split its rows into columns based on unique values of a given column, joining with values of the other column. For illustrative purposes, let me use the following example, where my original dataframe is df.
df.show()
+-----+-----+
| col1| col2|
+-----+-----+
| z1| a1|
| z1| b2|
| z1| c3|
| x1| a1|
| x1| b2|
| x1| c3|
+-----+-----+
What I would like to do is to split on the unique values of col1, thus generating a new column (say, col3) by joining on the values of col2. The resulting dataframe that I am after would look like the following:
+-----+-----+-----+
| col1| col2| col3|
+-----+-----+-----+
| z1| a1| x1 |
| z1| b2| x1 |
| z1| c3| x1 |
+-----+-----+-----+
This illustrative example only contains two unique values in col1 (i.e. z1 and x1). Ideally, I would like to write a piece of code which automatically detects unique values in col1 and therefore generates a new corresponding column. Does anyone know where can I start from?
Edit: It is arbitrary that z1 and x1 end up being in col1 and col3, respectively. It could definitely be the other way round since I am simply just interested in splitting by unique values.
Many thanks in advance,
Marioanzas