PySpark: list column names based on characters in values

Question

In PySpark, I am trying to clean a dataset. Some of the columns have unwanted characters (=" ") in it's values. I read the dataset as a DataFrame and I already created a User Defined Function which can remove the characters successfully, but now I am struggling to write a script which can identify on which columns I need to perform the UserDefinedFunction. I only use the last row of the dataset, assuming the columns always contains similar entries.

DataFrame (df):

      id  value1   value2   value3    
="100010"     10       20    ="30"

In Python, the following works:

columns_to_fix = []    
for col in df:
    value = df[col][0]
    if type(value) == str and value.startswith('='):
        columns_to_fix.append(col)

I tried the following in PySpark, but this returns all the column names:

columns_to_fix = []    
for x in df.columns:
    if df[x].like('%="'):
        columns_to_fix.append(x)

Desired output:

columns_to_fix: ['id', 'value3']

Once I have the column names in a list, I can use a for loop to fix the entries in the columns. I am very new to PySpark, so my apologies if this is a too basic question. Thank you so much in advance for your advice!

df[x].like('%="') return an object which is not None therefore the test is always True. You need to collect() to check the content. — Steven
– Steven, Commented Sep 24, 2018 at 12:30

Florian · Accepted Answer · 2018-09-24 12:42:22Z

1

"I only use the last row of the dataset, assuming the columns always contains similar entries." Under that assumption, you could collect a single row and test if the character you are looking for is in there.

Also, note that you do not need a udf to replace = in your columns, you can use regexp_replace. A working example is given below, hope this helps!

import pyspark.sql.functions as F

df = spark.createDataFrame([['=123','456','789'], ['=456','789','123']], ['a', 'b','c'])
df.show()

# +----+---+---+
# |   a|  b|  c|
# +----+---+---+
# |=123|456|789|
# |=456|789|123|
# +----+---+---+

# list all columns with '=' in it.
row = df.limit(1).collect()[0].asDict()
columns_to_replace = [i for i,j in row.items() if '=' in j]

for col in columns_to_replace:
    df = df.withColumn(col, F.regexp_replace(col, '=', ''))

df.show()

# +---+---+---+
# |  a|  b|  c|
# +---+---+---+
# |123|456|789|
# |456|789|123|
# +---+---+---+

answered Sep 24, 2018 at 12:42

Florian

25.6k5 gold badges56 silver badges87 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

PythonSherpa Over a year ago

Nice solution! The 'regex_replace' unfortunately doesn't work on Spark 1.3, but this should do it. Thanks!

Florian Over a year ago

Glad I could help. I have no experience with that version of pyspark, and probably your UDF works perfectly fine, but since you mention you are really new to pyspark it might be interesting to take a look at this for future issues that are similar.

Collectives™ on Stack Overflow

PySpark: list column names based on characters in values

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related