1

I'm working on dataframe in pyspark. I've dataframe df and column col_1 which is array type and contains numbers as well.

Is there built in function to remove numbers from this string?

Dataframe schema:

>>> df.printSchema()
root
 |-- col_1: array (nullable = true)
 |    |-- element: string (containsNull = true)

Sample Values in Column:

>>>df.select("col_1").show(2,truncate=False)

+-------------------------------------------------------------------------------+
|col_1                                                                                                                                   
+-------------------------------------------------------------------------------+
|[use, bal, trans, ck, pay, billor, trans, cc, balances, got, grat, thnxs]                                                                  |
|[hello, like, farther, lower, apr, 11, 49, thank]|
+-------------------------------------------------------------------------------+

In this case, I'm looking for function which would strip number 11, 49 from second row. Thank you.

1 Answer 1

4

here is something you can try -

# Data preparation => 
data = [[['use', 'bal', 'trans', 'ck', 'pay', 'billor', 'trans', 'cc', 'balances', 'got', 'grat', 'thnxs']],
        [['hello', 'like', 'farther', 'lower', 'apr', '11', '49', 'thank']]]

df = sc.parallelize(data).toDF(["arr"])
df.printSchema()

:

root
 |-- arr: array (nullable = true)
 |    |-- element: string (containsNull = true)

:

from pyspark.sql.functions import explode,regexp_extract,col

df.select(explode(df.arr).alias('elements'))\
  .select(regexp_extract('elements','\d+',0)\
  .alias('Numbers'))\
  .filter(col('Numbers') != '').show()

Output :

+-------+
|Numbers|
+-------+
|     11|
|     49|
+-------+
Sign up to request clarification or add additional context in comments.

5 Comments

that worked. I was on vacation and couldn't test it. It worked fine.
Quick Sub Question: In my original ask was to strip numbers and keep charters only. I was finding syntax for keeping all characters so that my out put will be everything input string except numbers. Do you know syntax to define all charset in regexp_extract?
regex to find anything but digits would be \D+
ok, but I'm looking for syntax for filtering charters [ a to z].
For keep charters only : select(regexp_extract('elements','[a-zA-Z]+', 0))

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.