0

I am using Pyspark in Databricks with Spark 3.1.

I need to extract numbers from a text column in a dataframe using the regexp_extract_all function

Approach 1:

email_df11 = spark.sql("select New_id, regexp_extract_all(subject,'(?<!^DT!\\d)([D|d][T|t]\\d{12}|\\d{9,29})(?!\\d)', 1) as num_subject  from email_view")

This results num_subject column with empty lists.

No result

However when I use a view of the same data frame and run the query below. I am able to see the output.

Approach 2:

select New_id, regexp_extract_all(subject,'same regex as above', 1) as num_subject from email_view 

enter image description here

What do I need to change in Approach 1 in order to get a similar result.

1 Answer 1

2

You need to use four backslashes \\\\ to escape when using spark.sql:

email_df11 = spark.sql("select New_id, regexp_extract_all(subject,'(?<!^DT!\\\\d)([D|d][T|t]\\\\d{12}|\\\\d{9,29})(?!\\\\d)', 1) as num_subject  from email_view")

Or use python raw format string for the query:

email_df11 = spark.sql(r"select New_id, regexp_extract_all(subject,'(?<!^DT!\\d)([D|d][T|t]\\d{12}|\\d{9,29})(?!\\d)', 1) as num_subject  from email_view")
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.