I have a existing pyspark dataframe that has around 200 columns. I have a list of the column names (in the correct order and length).
How can I apply the list to the dataframe without using structtype?
I have a existing pyspark dataframe that has around 200 columns. I have a list of the column names (in the correct order and length).
How can I apply the list to the dataframe without using structtype?
Assuming the list of column names is in the right order and has a matching length you can use toDF
Preparing an example dataframe
import numpy as np
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(np.random.randint(1,10,(5,4)).tolist(), list('ABCD'))
df.show()
Output
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 6| 9| 4| 7|
| 6| 4| 7| 9|
| 2| 5| 2| 2|
| 3| 7| 4| 5|
| 8| 9| 6| 8|
+---+---+---+---+
Changing the column names
newcolumns = ['new_A','new_B','new_C','new_D']
df.toDF(*newcolumns).show()
Output
+-----+-----+-----+-----+
|new_A|new_B|new_C|new_D|
+-----+-----+-----+-----+
| 6| 9| 4| 7|
| 6| 4| 7| 9|
| 2| 5| 2| 2|
| 3| 7| 4| 5|
| 8| 9| 6| 8|
+-----+-----+-----+-----+
If you have list of columns pre-exiting, it would work fine:
df_list = ["newName_1", "newName_2", "newName_3", "newName_4"]
renamed_df = df.toDF(*df_list)
renamed_df.show()
But if you want to make it dynamic and without relying on list of columns, here is alternate way of doing it:
df.select([col(col_name).alias(col_name) for col_name in df])