1

I have a dataframe like below.

id  | js                                    |
0   | bla var test bla ..                   |
1   | bla function RAM blob                 |
2   | function CPU blob blob                |
3   | thanks                                |
4   | bla var AWS and function twitter blaa |

I am trying to extract the next word after function or var

My code is here.

pattern3 = "(func)\s+(\w+)|(var)\s+(\w+)"

df = df.withColumn("js_extracted2", f.regexp_extract(f.col("js"),pattern3,4))

as it is capture only one word, the final row returns only AWS and not Twitter.

So I would like to capture all matching.

My spark version is less than 3,

so I tried df.withColumn('output', f.expr("regexp_extract_all(js, '(func)\s+(\w+)|(var)\s+(\w+)', 4)")).show()

but it returns only empty for all rows.

my expected output is

id  | js                                    | output
0   | bla var test bla ..                   | [test]
1   | bla function RAM blob                 | [RAM]
2   | function CPU blob blob                | [CPU]
3   | thanks                                | 
4   | bla var AWS and function twitter blaa | [AWS, twitter]
1
  • can you try with my input, I tried but not giving expected output Commented Mar 7, 2022 at 10:16

3 Answers 3

3

You need to use four \ to form a regular expression.

df = df.withColumn("js_extracted2", F.expr(f"regexp_extract_all(js, '(function|var)\\\\s+(\\\\w+)', 2)"))
df.show(truncate=False)
Sign up to request clarification or add additional context in comments.

5 Comments

Alternatively, you can make your string a raw string and you'll only need two `\` characters. More info: docs.python.org/3/library/re.html
@AutomatedChaos Yes, Thanks. Another way is to adjust the ``spark.sql.parser.escapedStringLiterals``` parameter to true.
can you help in escaping this regex to work with regex_extract_all (?:[\w$]+?\.)(?:[\w$]+?\.?)+
Can you give complete requirements, including input and output samples.
hi @过过招 , actually this solution works fine on my local but in my server the spark version is 2.4.7. May I know how can I do it on lower version.. error: pyspark.sql.utils.AnalysisException: "Undefined function: 'regexp_extract_all'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 0"
2

use a positive look behind ascertion (?<=i)j. That is extract j only if i is immediately to its left

df.withColumn('name', regexp_extract('js', '((?<=function|var)(\s\w+))',1)).show()

---+--------------------+-----+
| id|                  js| name|
+---+--------------------+-----+
|  0| bla var test bla ..| test|
|  1| bla function RAM...|  RAM|
|  2|function CPU blob...|  CPU|
|  3|             thanks |     |
|  4|bla var AWS and f...|  AWS|
+---+--------------------+-----+

1 Comment

actually the approved solution works fine on my local but in my server the spark version is 2.4.7. May I know how can I do it on lower version.. error: pyspark.sql.utils.AnalysisException: "Undefined function: 'regexp_extract_all'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 0"
2

The question looks more like how to do this without using regexp_extract_all to me.

Here is another way without regex, just incase one has a problem of using regexp_extract_all:

noofwordstoextract = 1
df.withColumn("ArrayOfWords",F.split("js"," "))\
.withColumn("test",F.expr(f"""filter(transform(ArrayOfWords,(x,e)-> 
CASE WHEN x in ('var','function') 
THEN array_join(slice(ArrayOfWords,e+2,{noofwordstoextract}),' ') ELSE NULL END)
,y-> y is not NULL)""")).drop("ArrayOfWords").show()


+---+-------------------------------------+--------------+
|id |js                                   |test          |
+---+-------------------------------------+--------------+
|0  |bla var test bla ..                  |[test]        |
|1  |bla function RAM blob                |[RAM]         |
|2  |function CPU blob blob               |[CPU]         |
|3  |thanks                               |[]            |
|4  |bla var AWS and function twitter blaa|[AWS, twitter]|
+---+-------------------------------------+--------------+

This solution splits the strings into an array then checks the current element is in 'var' or 'function' and then extracts the next n words (here 1) and then joins them to form the original string (try with noofwordstoextract=2) for more details.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.