Pyspark: Remove UTF null character from pyspark dataframe

Question

I have a pyspark dataframe similar to the following:

df = sql_context.createDataFrame([
  Row(a=3, b=[4,5,6],c=[10,11,12], d='bar', e='utf friendly'),
  Row(a=2, b=[1,2,3],c=[7,8,9], d='foo', e=u'ab\u0000the')
  ])

Where one of the values for column e contains the UTF null character \u0000. If I try to load this df into a postgresql database, I get the following error:

ERROR: invalid byte sequence for encoding "UTF8": 0x00

which makes sense. How can I efficiently remove the null character from the pyspark dataframe before loading the data into postgres?

I have tried using some of the pyspark.sql.functions to clean the data first without success. encode, decode, and regex_replace did not work:

df.select(regexp_replace(col('e'), u'\u0000', ''))
df.select(encode(col('e'), 'UTF-8'))
df.select(decode(col('e'), 'UTF-8'))

Ideally, I would like to clean the entire dataframe without specifying exactly which columns or what the violating character is, since I don't necessarily know this information ahead of time.

I am using a postgres 9.4.9 database with UTF8 encoding.

Steve · Accepted Answer · 2016-12-15 19:10:07Z

2

Ah wait - I think I have it. If I do something like this, it seems to work:

null = u'\u0000'
new_df = df.withColumn('e', regexp_replace(df['e'], null, ''))

And then mapping to all string columns:

string_columns = ['d','e']
new_df = df.select(
  *(regexp_replace(col(c), null, '').alias(c) if c in string_columns else c for
    c in df.columns)
  )

edited Dec 15, 2016 at 19:10

answered Dec 14, 2016 at 21:33

Steve

2,5514 gold badges26 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Community · Accepted Answer · 2020-06-20 09:12:55Z

0

You can use DataFrame.fillna() to replace null values.

Replace null values, alias for na.fill(). DataFrame.fillna() and DataFrameNaFunctions.fill() are aliases of each other.

Parameters:

value – int, long, float, string, or dict. Value to replace null values with. If the value is a dict, then subset is ignored and value must be a mapping from column name (string) to replacement value. The replacement value must be an int, long, float, or string.

subset – optional list of column names to consider. Columns specified in subset that do not have matching data type are ignored. For example, if value is a string, and subset contains a non-string column, then the non-string column is simply ignored.

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Dec 15, 2016 at 6:57

Nirmal

9,56711 gold badges61 silver badges82 bronze badges

1 Comment

Steve Over a year ago

I don't think that works here, because the problem cell isn't actually null - it contains the UTF null character \u0000. If I run df.fillna() on my example df, it looks like it returns the same dataframe, since none of the cells are actually null. If I try to load the resulting df into a postgres table, I still get the same error message.

Collectives™ on Stack Overflow

Pyspark: Remove UTF null character from pyspark dataframe

2 Answers 2

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related