How to prefix columns names of dataframe efficiently without creating a new dataframe in Pyspark? [duplicate]

Question

In pandas you can rename all columns in one go in a "inplace" manner using

new_column_name_list =['Pre_'+x for x in df.columns]
df.columns = new_column_name_list

Can we do the above same step in Pyspark without having to finally create new dataframe? It is inefficient because we will have 2 dataframe with the same data but different column names leading to bad memory utlilization.

The below link answers the question but its not inplace.

How to change dataframe column names in pyspark? EDIT My question is clearly different from the question in above link

Please read my question again. I have clearly mentioned how tha question is different from what I am asking. — GeorgeOfTheRF
– GeorgeOfTheRF, Commented Jun 15, 2017 at 9:23
The answers in the linked question, seems to answer your question, e.g. data = data.select(col("Name").alias("name"), col("askdaosdka").alias("age")) — Yaron
– Yaron, Commented Jun 15, 2017 at 9:24
Aliasing creates a new DataFrame object, but it doesn't create a copy of the data. Unless you're worrying about local driver memory (in that case there is no good news for you) this is a duplicate. — zero323
– zero323, Commented Jun 15, 2017 at 11:36
This will do left_cols = df.columns 'df = df.selectExpr([col + ' as left_' + col for col in left_cols]) — Nidhi
– Nidhi, Commented Mar 23, 2021 at 13:47

koiralo · Accepted Answer · 2017-06-15 11:25:26Z

1

This is how you could do it in scala spark Create a map of new column and old column name dynamically and select with alias.

val to = df2.columns.map(col(_))

val from = (1 to to.length).map( i => (s"column$i"))

df2.select(to.zip(from).map { case (x, y) => x.alias(y) }: _*).show

Previouse column names

"age", "names"

After changed

"column1". "column2"

However dataframe cannot be updated since they are immutable, But can bes assigned to new one for the further use. Only used dataframes are loaded in memory so this won't be issue.

Hope this helps

edited Jun 15, 2017 at 11:25

answered Jun 15, 2017 at 10:09

koiralo

23.2k6 gold badges57 silver badges77 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

GeorgeOfTheRF Over a year ago

Based on the above code we cannot rename on the existing dataframe itseft right? We will have to final say df3=df2.select(to.zip(from).map { case (x, y) => x.alias(y) }: _*) to make the change permananent

GeorgeOfTheRF Over a year ago

Will df2=df2.select(to.zip(from).map { case (x, y) => x.alias(y) }: _*) work?

GeorgeOfTheRF Over a year ago

This wont work because spark df is immutable?

koiralo Over a year ago

Yes it changes the column name at once but does not changes in the original dataframe it returns a new dataframe, since the dataframe are immutable.

Collectives™ on Stack Overflow

How to prefix columns names of dataframe efficiently without creating a new dataframe in Pyspark? [duplicate]

1 Answer 1

4 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Linked

Related