Hi I have a dataframe
client_id| event_metadata |
+---------+-----------------------------------------------------
| 18890 |{Scripname:"DELL", Exchange: "NSE", Segment: "EQ" } |
| 10531 |{Scripname:"NAUKRI", Exchange: "NSE", Segment: "EQ" }|
I want to extract event_metadata and store only ScripName along with client_id as a dataframe.
event_metadata is String and not json.
I have tried
from pyspark.sql import functions as F
df1.select('client_id', F.json_tuple('event_metadata', 'Scripname',
'Exchange','Segment').alias('Scripname',
'Exchange','Segment')).show()
Its returning Null values
I have also tried using regex but showing error
from pyspark.sql.functions import regexp_extract
df1.withColumn("event_metadata", regexp_extract("event_metadata", "(?
<=Scripname: )\w+(?=(,|}))", 0))\
.show(truncate=False)
Desired Output:
client_id| Scripname|
+--------+-----------
| 18890 | DELL |
| 10531 | NAUKRI |
from_json()requires a schema however. spark.apache.org/docs/latest/api/python/pyspark.sql.html