2

My dataframe returns the below result as String.

  QueryResult{status='success', finalSuccess=true, parseSuccess=true, allRows=[{"cbcnt":0}], signature={"cbcnt":"number"}, info=N1qlMetrics{resultCount=1, errorCount=0, warningCount=0, mutationCount=0, sortCount=0, resultSize=11, elapsedTime='5.080179ms', executionTime='4.931124ms'}, profileInfo={}, errors=[], requestId='754d19f6-7ec1-4609-bf2a-54214d06c57c', clientContextId='542bc4c8-1a56-4afb-8c2f-63d81e681cb4'}   |

  QueryResult{status='success', finalSuccess=true, parseSuccess=true, allRows=[{"cbcnt":"2021-07-30T00:00:00-04:00"}], signature={"cbcnt":"String"}, info=N1qlMetrics{resultCount=1, errorCount=0, warningCount=0, mutationCount=0, sortCount=0, resultSize=11, elapsedTime='5.080179ms', executionTime='4.931124ms'}, profileInfo={}, errors=[], requestId='754d19f6-7ec1-4609-bf2a-54214d06c57c', clientContextId='542bc4c8-1a56-4afb-8c2f-63d81e681cb4'}

I just want

"cbcnt":0  <-- Numeric part of this

Expected Output

col
----
0
2021-07-30

Tried:

.withColumn("CbRes",regexp_extract($"Col", """"cbcnt":(\S*\d+)""", 1)) 

Output

 col
    ----
    0
    "2021-07-30 00:00:00   --<--additional " is coming
2
  • There is nothing built into Spark to help you with this. You will have to use transformation to do it yourself by splitting strings using regex and such with plain Scala. Commented Sep 10, 2021 at 13:02
  • github.com/lauris/awesome-scala#parsing Commented Sep 10, 2021 at 14:02

2 Answers 2

1

Using the Pyspark function regexp_extract:

from pyspark.sql import functions as F

df = <dataframe with a column "text" that contains the input data">
df.withColumn("col", F.regexp_extract("text", """"cbcnt":(\d+)""", 1)).show()
Sign up to request clarification or add additional context in comments.

7 Comments

Works well.. Thnx
When "cbcnt": "2021-07-30T00:00:00-04:00" inplace of digit then what i need to do. "\d+" takes only digit. I want the date part ie. 2021-07-30
@VnS you can try df.withColumn("col", F.regexp_extract("text", """"cbcnt":"(\d{4}-\d{2}-\d{2}).*".""", 1)).show()
This doesn't give the correct result. Column now becomes null. My column has Numeric plus date content as string. I want something which pick anything which comes after cbcnt either number of date.
@VnS I don't quite understand if you only want to get the date part or anything after cbcnt. Maybe you could create a new question with example input data and the expected output?
|
1

Extract via regex:

val value = "QueryResult{status='success', finalSuccess=true, parseSuccess=true, allRows=[{\"cbcnt\":0}], signature={\"cbcnt\":\"number\"}, info=N1qlMetrics{resultCount=1, errorCount=0, warningCount=0, mutationCount=0, sortCount=0, resultSize=11, elapsedTime='5.080179ms', executionTime='4.931124ms'}, profileInfo={}, errors=[], requestId='754d19f6-7ec1-4609-bf2a-54214d06c57c', clientContextId='542bc4c8-1a56-4afb-8c2f-63d81e681cb4'}   |"
val regex = """"cbcnt":(\d+)""".r.unanchored
val s"${regex(result)}" = value

println(result)

Output:

0

9 Comments

This is erroring out. The error is as below: method s is not a case class, nor does it have an unapply/unapplySeq member val s"${regex(result)}" = value
@vnsingh Then your version of Scala < 2.13. Starting Scala 2.13 add this
Nevertheless, I believe that Werner's answer is more correct. Since it is in the context of using a Apache-Spark.
When "cbcnt": "2021-07-30T00:00:00-04:00" inplace of digit then what i need to do. "\d+" takes only digit. I want the date part ie. 2021-07-30
allRows.*?cbcnt":(.*?)}
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.