0

Hi I have a dataframe

client_id| event_metadata                                      |
+---------+-----------------------------------------------------
| 18890  |{Scripname:"DELL", Exchange: "NSE", Segment: "EQ" }  |
| 10531  |{Scripname:"NAUKRI", Exchange: "NSE", Segment: "EQ" }|

I want to extract event_metadata and store only ScripName along with client_id as a dataframe.

event_metadata is String and not json.

I have tried

from pyspark.sql import functions as F

df1.select('client_id', F.json_tuple('event_metadata', 'Scripname', 
 'Exchange','Segment').alias('Scripname',
  'Exchange','Segment')).show()

Its returning Null values

I have also tried using regex but showing error

from pyspark.sql.functions import regexp_extract

df1.withColumn("event_metadata", regexp_extract("event_metadata", "(? 
<=Scripname: )\w+(?=(,|}))", 0))\
 .show(truncate=False)

Desired Output:

client_id| Scripname|
+--------+-----------
| 18890  |  DELL    |
| 10531  |  NAUKRI  |
1

2 Answers 2

1

Try this-

regexp_extract

df2.withColumn("Scripname",
      regexp_extract($"event_metadata", "^\\{\\s*Scripname\\s*:\\s*\"(\\w+)\"", 1)
    )
      .show(false)

    df2.withColumn("Scripname",
      expr("""regexp_extract(event_metadata, '^\\{\\s*Scripname\\s*:\\s*"(\\w+)"', 1)""")
    )
      .show(false)


    /**
      * +---------+-----------------------------------------------------+---------+
      * |client_id|event_metadata                                       |Scripname|
      * +---------+-----------------------------------------------------+---------+
      * |18890    |{Scripname:"DELL", Exchange: "NSE", Segment: "EQ" }  |DELL     |
      * |10531    |{Scripname:"NAUKRI", Exchange: "NSE", Segment: "EQ" }|NAUKRI   |
      * +---------+-----------------------------------------------------+---------+
      */
Sign up to request clarification or add additional context in comments.

Comments

1

Define your schema properly and read the data by from_json.

import pyspark.sql.functions as f
from pyspark.sql.types import StructType, StructField, StringType

schema = StructType([StructField('Scripname', StringType(), True), StructField('Exchange', StringType(), True), StructField('Segment', StringType(), True)])
df.withColumn('from_json', f.from_json('event_metadata', schema)) \
  .show(10, False)

+---------+-----------------------------------------------------------+-----------------+
|client_id|event_metadata                                             |from_json        |
+---------+-----------------------------------------------------------+-----------------+
|18890    |{"Scripname": "DELL", "Exchange": "NSE", "Segment": "EQ"}  |[DELL, NSE, EQ]  |
|10531    |{"Scripname": "NAUKRI", "Exchange": "NSE", "Segment": "EQ"}|[NAUKRI, NSE, EQ]|
+---------+-----------------------------------------------------------+-----------------+

Now, your from_json column is struct type and can select the elements of them by col('from_json.Scripname').

1 Comment

event_metadata is not actually a valid json, so using from_json doesn't help.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.