1

I have a file with many JSON records in it. Each record contains a Struct ('Properties') and within each, a String that looks like this:

'meterDetails: "@{meterName=Read Operations; meterCategory=Storage; meterSubCategory=General Block Blob; unitOfMeasure=100000000}"'

Note that values are not enclosed in a "".

I want to treat this column (meterDetails) as another Struct in my DF as all structs will be flattened eventually.

Proceeding with defining a schema, removing the @ with regexp_replace('col','@','') and using from_json with the schema resulted in a new col in Json format, but all NULL values.

Splitting the col with split(col("meterDetails"),";")) turns it into an Array, but upon conversion to Json - back to all NULL values.

Question:

I'm clearly misunderstanding the @{..} structured passed by this API. In Spark, should I convert this string to an object that natively will result to a Struct?

0

2 Answers 2

3

Somehow I gravitate towards str_to_map function, but then you will need to transform the map to struct. I'm not sure this is the best way, but I'd do it like this.

Input:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [("@{meterName=Read Operations; meterCategory=Storage; meterSubCategory=General Block Blob; unitOfMeasure=100000000}",)],
    ['meterDetails'])
df.show(truncate=0)
# +-----------------------------------------------------------------------------------------------------------------+
# |meterDetails                                                                                                     |
# +-----------------------------------------------------------------------------------------------------------------+
# |@{meterName=Read Operations; meterCategory=Storage; meterSubCategory=General Block Blob; unitOfMeasure=100000000}|
# +-----------------------------------------------------------------------------------------------------------------+

Script:

# Converting string to map
df = df.withColumn(
    "meterDetails",
    F.expr("str_to_map(TRIM(BOTH '@{}' FROM meterDetails), '; ', '=')")
)
# Converting map to struct
df = df.withColumn("meterDetails", F.to_json("meterDetails"))
json_schema = spark.read.json(df.rdd.map(lambda r: r.meterDetails)).schema
df = df.withColumn("meterDetails", F.from_json("meterDetails", json_schema))

df.show(truncate=0)
# +---------------------------------------------------------+
# |meterDetails                                             |
# +---------------------------------------------------------+
# |{Storage, Read Operations, General Block Blob, 100000000}|
# +---------------------------------------------------------+
df.printSchema()
# root
#  |-- meterDetails: struct (nullable = true)
#  |    |-- meterCategory: string (nullable = true)
#  |    |-- meterName: string (nullable = true)
#  |    |-- meterSubCategory: string (nullable = true)
#  |    |-- unitOfMeasure: string (nullable = true)
Sign up to request clarification or add additional context in comments.

2 Comments

Cool usages of str_to_map and trim, very good to know!
Thank you. str_to_map is new to me. Makes sense now.
1

To transform your dataset into an operable JSON file, I would first use regexp_extract to get whatever is within @{ and }, that is done through:

df1 = df1.withColumn("data", regexp_extract(col("string"), "\\{(.*)\\}", 1))

Now, data looks as below:

meterName=Read Operations; meterCategory=Storage; meterSubCategory=General Block Blob; unitOfMeasure=100000000

Next, I would split on ; and transform your data into a named_struct:

df1 = df1.withColumn("data", split(col("data"), ";"))

df1 = df1.withColumn("data",
  expr("transform(data, x -> named_struct('key',split(x, \"=\")[0],'value',split(x, \"=\")[1]))")
)

Now, data looks as:

[{meterName, Read Operations}, { meterCategory, Storage}, { meterSubCategory, General Block Blob}, { unitOfMeasure, 100000000}]

where col(data)[0].key gives meterName for example.

For more details, let's say you want another column to extract keys only (not, key and value are hard-coded from the previous step):

df1 = df1.withColumn("otherData", expr("transform(data, x -> x.key)"))

The result:

[meterName,  meterCategory,  meterSubCategory,  unitOfMeasure]

I hope this is what you are looking for, good luck!

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.