How to extract data from a column which has json type strings in pyspark?

Question

Hi I have a dataframe

client_id| event_metadata                                      |
+---------+-----------------------------------------------------
| 18890  |{Scripname:"DELL", Exchange: "NSE", Segment: "EQ" }  |
| 10531  |{Scripname:"NAUKRI", Exchange: "NSE", Segment: "EQ" }|

I want to extract event_metadata and store only ScripName along with client_id as a dataframe.

event_metadata is String and not json.

I have tried

from pyspark.sql import functions as F

df1.select('client_id', F.json_tuple('event_metadata', 'Scripname', 
 'Exchange','Segment').alias('Scripname',
  'Exchange','Segment')).show()

Its returning Null values

I have also tried using regex but showing error

from pyspark.sql.functions import regexp_extract

df1.withColumn("event_metadata", regexp_extract("event_metadata", "(? 
<=Scripname: )\w+(?=(,|}))", 0))\
 .show(truncate=False)

Desired Output:

client_id| Scripname|
+--------+-----------
| 18890  |  DELL    |
| 10531  |  NAUKRI  |

try from_json() requires a schema however. spark.apache.org/docs/latest/api/python/pyspark.sql.html — Smurphy0000
– Smurphy0000, Commented Aug 14, 2020 at 3:31

Som · Accepted Answer · 2020-08-14 04:55:50Z

1

Try this-

regexp_extract

df2.withColumn("Scripname",
      regexp_extract($"event_metadata", "^\\{\\s*Scripname\\s*:\\s*\"(\\w+)\"", 1)
    )
      .show(false)

    df2.withColumn("Scripname",
      expr("""regexp_extract(event_metadata, '^\\{\\s*Scripname\\s*:\\s*"(\\w+)"', 1)""")
    )
      .show(false)


    /**
      * +---------+-----------------------------------------------------+---------+
      * |client_id|event_metadata                                       |Scripname|
      * +---------+-----------------------------------------------------+---------+
      * |18890    |{Scripname:"DELL", Exchange: "NSE", Segment: "EQ" }  |DELL     |
      * |10531    |{Scripname:"NAUKRI", Exchange: "NSE", Segment: "EQ" }|NAUKRI   |
      * +---------+-----------------------------------------------------+---------+
      */

answered Aug 14, 2020 at 4:55

Som

6,3681 gold badge13 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Daeho Ro · Accepted Answer · 2020-08-14 05:19:37Z

1

Define your schema properly and read the data by from_json.

import pyspark.sql.functions as f
from pyspark.sql.types import StructType, StructField, StringType

schema = StructType([StructField('Scripname', StringType(), True), StructField('Exchange', StringType(), True), StructField('Segment', StringType(), True)])
df.withColumn('from_json', f.from_json('event_metadata', schema)) \
  .show(10, False)

+---------+-----------------------------------------------------------+-----------------+
|client_id|event_metadata                                             |from_json        |
+---------+-----------------------------------------------------------+-----------------+
|18890    |{"Scripname": "DELL", "Exchange": "NSE", "Segment": "EQ"}  |[DELL, NSE, EQ]  |
|10531    |{"Scripname": "NAUKRI", "Exchange": "NSE", "Segment": "EQ"}|[NAUKRI, NSE, EQ]|
+---------+-----------------------------------------------------------+-----------------+

Now, your from_json column is struct type and can select the elements of them by col('from_json.Scripname').

answered Aug 14, 2020 at 5:19

Daeho Ro

13.7k4 gold badges25 silver badges50 bronze badges

1 Comment

Som Over a year ago

event_metadata is not actually a valid json, so using from_json doesn't help.

Collectives™ on Stack Overflow

How to extract data from a column which has json type strings in pyspark?

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related