1

In pandas you can rename all columns in one go in a "inplace" manner using

new_column_name_list =['Pre_'+x for x in df.columns]
df.columns = new_column_name_list

Can we do the above same step in Pyspark without having to finally create new dataframe? It is inefficient because we will have 2 dataframe with the same data but different column names leading to bad memory utlilization.

The below link answers the question but its not inplace.

How to change dataframe column names in pyspark? EDIT My question is clearly different from the question in above link

5
  • Please read my question again. I have clearly mentioned how tha question is different from what I am asking. Commented Jun 15, 2017 at 9:23
  • The answers in the linked question, seems to answer your question, e.g. data = data.select(col("Name").alias("name"), col("askdaosdka").alias("age")) Commented Jun 15, 2017 at 9:24
  • No it doesnt because a new dataframe is created Commented Jun 15, 2017 at 10:22
  • Aliasing creates a new DataFrame object, but it doesn't create a copy of the data. Unless you're worrying about local driver memory (in that case there is no good news for you) this is a duplicate. Commented Jun 15, 2017 at 11:36
  • This will do left_cols = df.columns 'df = df.selectExpr([col + ' as left_' + col for col in left_cols]) Commented Mar 23, 2021 at 13:47

1 Answer 1

1

This is how you could do it in scala spark Create a map of new column and old column name dynamically and select with alias.

val to = df2.columns.map(col(_))

val from = (1 to to.length).map( i => (s"column$i"))

df2.select(to.zip(from).map { case (x, y) => x.alias(y) }: _*).show

Previouse column names

"age", "names"

After changed

"column1". "column2"

However dataframe cannot be updated since they are immutable, But can bes assigned to new one for the further use. Only used dataframes are loaded in memory so this won't be issue.

Hope this helps

Sign up to request clarification or add additional context in comments.

4 Comments

Based on the above code we cannot rename on the existing dataframe itseft right? We will have to final say df3=df2.select(to.zip(from).map { case (x, y) => x.alias(y) }: _*) to make the change permananent
Will df2=df2.select(to.zip(from).map { case (x, y) => x.alias(y) }: _*) work?
This wont work because spark df is immutable?
Yes it changes the column name at once but does not changes in the original dataframe it returns a new dataframe, since the dataframe are immutable.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.