1

In PySpark, I am trying to clean a dataset. Some of the columns have unwanted characters (=" ") in it's values. I read the dataset as a DataFrame and I already created a User Defined Function which can remove the characters successfully, but now I am struggling to write a script which can identify on which columns I need to perform the UserDefinedFunction. I only use the last row of the dataset, assuming the columns always contains similar entries.

DataFrame (df):

      id  value1   value2   value3    
="100010"     10       20    ="30"

In Python, the following works:

columns_to_fix = []    
for col in df:
    value = df[col][0]
    if type(value) == str and value.startswith('='):
        columns_to_fix.append(col)   

I tried the following in PySpark, but this returns all the column names:

columns_to_fix = []    
for x in df.columns:
    if df[x].like('%="'):
        columns_to_fix.append(x)

Desired output:

columns_to_fix: ['id', 'value3']

Once I have the column names in a list, I can use a for loop to fix the entries in the columns. I am very new to PySpark, so my apologies if this is a too basic question. Thank you so much in advance for your advice!

1
  • 2
    df[x].like('%="') return an object which is not None therefore the test is always True. You need to collect() to check the content. Commented Sep 24, 2018 at 12:30

1 Answer 1

1

"I only use the last row of the dataset, assuming the columns always contains similar entries." Under that assumption, you could collect a single row and test if the character you are looking for is in there.

Also, note that you do not need a udf to replace = in your columns, you can use regexp_replace. A working example is given below, hope this helps!

import pyspark.sql.functions as F

df = spark.createDataFrame([['=123','456','789'], ['=456','789','123']], ['a', 'b','c'])
df.show()

# +----+---+---+
# |   a|  b|  c|
# +----+---+---+
# |=123|456|789|
# |=456|789|123|
# +----+---+---+

# list all columns with '=' in it.
row = df.limit(1).collect()[0].asDict()
columns_to_replace = [i for i,j in row.items() if '=' in j]

for col in columns_to_replace:
    df = df.withColumn(col, F.regexp_replace(col, '=', ''))

df.show()

# +---+---+---+
# |  a|  b|  c|
# +---+---+---+
# |123|456|789|
# |456|789|123|
# +---+---+---+
Sign up to request clarification or add additional context in comments.

2 Comments

Nice solution! The 'regex_replace' unfortunately doesn't work on Spark 1.3, but this should do it. Thanks!
Glad I could help. I have no experience with that version of pyspark, and probably your UDF works perfectly fine, but since you mention you are really new to pyspark it might be interesting to take a look at this for future issues that are similar.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.