Replace Special characters of column names in Spark dataframe

Question

I have my input spark-dataframe named df as,

+---------------+----------------+-----------------------+
|Main_CustomerID|126+ Concentrate|2.5 Ethylhexyl_Acrylate|
+---------------+----------------+-----------------------+
|         725153|             3.0|                    2.0|
|         873008|             4.0|                    1.0|
|         625109|             1.0|                    0.0|
+---------------+----------------+-----------------------+

I need to remove the special characters from the column names of df like following,

Remove +
Replace space as underscore
Replace dot as underscore

So my df should be like

+---------------+---------------+-----------------------+
|Main_CustomerID|126_Concentrate|2_5_Ethylhexyl_Acrylate|
+---------------+---------------+-----------------------+
|         725153|            3.0|                    2.0|
|         873008|            4.0|                    1.0|
|         625109|            1.0|                    0.0|
+---------------+---------------+-----------------------+

Using Scala, I have achieved this by,

var tableWithColumnsRenamed = df

for (field <- tableWithColumnsRenamed.columns) {
      tableWithColumnsRenamed = tableWithColumnsRenamed
        .withColumnRenamed(field, field.replaceAll("\\.", "_"))
    }
for (field <- tableWithColumnsRenamed.columns) {
      tableWithColumnsRenamed = tableWithColumnsRenamed
        .withColumnRenamed(field, field.replaceAll("\\+", ""))
    }
for (field <- tableWithColumnsRenamed.columns) {
      tableWithColumnsRenamed = tableWithColumnsRenamed
        .withColumnRenamed(field, field.replaceAll(" ", "_"))
    }

df = tableWithColumnsRenamed

When I used,

for (field <- tableWithColumnsRenamed.columns) {
      tableWithColumnsRenamed = tableWithColumnsRenamed
        .withColumnRenamed(field, field.replaceAll("\\.", "_"))
    .withColumnRenamed(field, field.replaceAll("\\+", ""))
    .withColumnRenamed(field, field.replaceAll(" ", "_"))
    }

I got the first column name as 126 Concentrate instead of getting 126_Concentrate

But I don't prefer 3 for loops for this replacement. Can I get the solution?

Momer · Accepted Answer · 2018-08-10 15:49:48Z

11

df
  .columns
  .foldLeft(df){(newdf, colname) =>
    newdf.withColumnRenamed(colname, colname.replace(" ", "_").replace(".", "_"))
  }
  .show

edited Aug 10, 2018 at 15:49

Momer

3,24824 silver badges24 bronze badges

answered Jun 29, 2018 at 9:32

Chandan Ray

2,0911 gold badge13 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

PRIYA M Over a year ago

Yes, it works perfect. According to my use case, I have changed the solution as,

df.columns.foldLeft(df){(newdf, colname) => newdf.withColumnRenamed(colname,colname.replace(" ", "_").replace(".", "_").replace("+",""))}

Anahcolus · Accepted Answer · 2018-06-29 09:42:53Z

You can use withColumnRenamed regex replaceAllIn and foldLeft as below

val columns = df.columns

val regex = """[+._, ]+"""
val replacingColumns = columns.map(regex.r.replaceAllIn(_, "_"))

val resultDF = replacingColumns.zip(columns).foldLeft(df){(tempdf, name) => tempdf.withColumnRenamed(name._2, name._1)}

resultDF.show(false)

which should give you

+---------------+---------------+-----------------------+
|Main_CustomerID|126_Concentrate|2_5_Ethylhexyl_Acrylate|
+---------------+---------------+-----------------------+
|725153         |3.0            |2.0                    |
|873008         |4.0            |1.0                    |
|625109         |1.0            |0.0                    |
+---------------+---------------+-----------------------+

I hope the answer is helpful

Bùi Đức Khánh · Accepted Answer · 2019-10-04 08:21:56Z

0

In java you can iterate over column names using df.columns() and replace each header string with string replaceAll(regexPattern, IntendedCharreplacement)

Then use withColumnRenamed(headerName, correctedHeaderName) to rename df header.

eg -

for (String headerName : dataset.columns()) {
    String correctedHeaderName = headerName.replaceAll(" ","_").replaceAll("+","_");
    dataset = dataset.withColumnRenamed(headerName, correctedHeaderName);
}
dataset.show();

edited Oct 4, 2019 at 8:21

Bùi Đức Khánh

4,6987 gold badges32 silver badges46 bronze badges

answered Oct 4, 2019 at 7:21

NiharGht

1615 silver badges11 bronze badges

Comments

kevin_theinfinityfund · Accepted Answer · 2019-12-31 21:19:52Z

0

Piggybacking Ramesh's answer, here is a reusable function using the currying syntax with the .transform() method & makes the columns lower case:

// Format all column names with regex with lower_case names
def formatAllColumns(regex_string:String)(df: DataFrame): DataFrame = {
  val replacingColumns = df.columns.map(regex_string.r.replaceAllIn(_, "_"))
  val resultDF:DataFrame = replacingColumns.zip(df.columns).foldLeft(df){
    (tempdf, name) => tempdf.withColumnRenamed(name._2, name._1.toLowerCase())
  }
  resultDF
}
val resultDF = df.transform(formatAllColumns(regex_string="""[+._(), ]+"""))

answered Dec 31, 2019 at 21:19

kevin_theinfinityfund

2,20519 silver badges19 bronze badges

Comments

akgarg511 · Accepted Answer · 2020-12-28 17:16:25Z

0

We can remove all the characters just by mapping column_name with new name after replacing special characters using replaceAll for the respective character and this single line of code is tried and tested with spark scala.

df.select(
          df.columns
            .map(colName => col(s"`${colName}`").as(colName.replaceAll("\\.", "_").replaceAll(" ", "_"))): _*
        ).show(false)

edited Dec 28, 2020 at 17:16

answered Dec 28, 2020 at 16:38

akgarg511

12 bronze badges

1 Comment

Sabito Over a year ago

While this code may solve the problem the answer would be a lot better with an explanation on how/why it does. Remember that your answer is not just for the user that asked the question but also for all the other people that find it.

Collectives™ on Stack Overflow

Replace Special characters of column names in Spark dataframe

5 Answers 5

1 Comment

Comments

Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

1 Comment

Comments

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related