2

Need help in parsing a string, where it contains values for each attribute. below is my sample string...

otherPartofString Name=<Series VR> Type=<1Ac4> SqVal=<34> conn ID=<2>

sometimes, the string can include other values with a different delimiter like

otherPartofString Name=<Series X> Type=<1B3> SqVal=<34> conn ID=<2> conn Loc=sfo dest=chc bridge otherpartofString.. 

the output columns will be

Name      | Type | SqVal | ID | Loc  | dest 
-------------------------------------------
Series VR | 1Ac4 | 34    | 2  | null | null
Series X  | 1B3  | 34    | 2  | sfo  | chc 
1
  • Is the enclosing < > used to contain values having SPACEs? otherwise it will be hard to separate dest=chc with bridge otherpartofString? we will either need some pre-processing or post-processing on the last capturing field. Commented Nov 5, 2020 at 20:00

1 Answer 1

3

As we discussed, to use str_to_map function on your sample data, we can setup pairDelim and keyValueDelim to the following:

pairDelim: '(?i)>? *(?=Name|Type|SqVal|conn ID|conn Loc|dest|$)'
keyValueDelim: '=<?'

Where pariDelim is case-insensitive (?i) with an optional > followed by zero or more SPACEs, then followed by one of the pre-defined keys (we use '|'.join(keys) to generate it dynamically) or the end of string anchor $. keyValueDelim is an '=' with an optional <.

from pyspark.sql import functions as F

df = spark.createDataFrame([                                               
   ("otherPartofString Name=<Series VR> Type=<1Ac4> SqVal=<34> conn ID=<2>",),   
   ("otherPartofString Name=<Series X> Type=<1B3> SqVal=<34> conn ID=<2> conn Loc=sfo dest=chc bridge otherpartofString..",)
],["value"])

keys = ["Name", "Type", "SqVal", "conn ID", "conn Loc", "dest"]

# add the following conf for Spark 3.0 to overcome duplicate map key ERROR
#spark.conf.set("spark.sql.mapKeyDedupPolicy", "LAST_WIN")

df.withColumn("m", F.expr("str_to_map(value, '(?i)>? *(?={}|$)', '=<?')".format('|'.join(keys)))) \
    .select([F.col('m')[k].alias(k) for k in keys]) \
    .show()
+---------+----+-----+-------+--------+--------------------+
|     Name|Type|SqVal|conn ID|conn Loc|                dest|
+---------+----+-----+-------+--------+--------------------+
|Series VR|1Ac4|   34|      2|    null|                null|
| Series X| 1B3|   34|      2|     sfo|chc bridge otherp...|
+---------+----+-----+-------+--------+--------------------+

We will need to do some post-processing to the values of the last mapped-key, since there is no anchor or pattern to distinguish them from other unrelated text (this could be a problem as it might happen on any keys), please let me know if you can specify any pattern.

Edit: If using map is less efficient for case-insensitive search since it requires some expensive pre-processing, try the following:

ptn = '|'.join(keys)
df.select("*", *[F.regexp_extract('value', r'(?i)\b{0}=<?([^=>]+?)>? *(?={1}|$)'.format(k,ptn), 1).alias(k) for k in keys]).show()

In case the angle brackets < and > are used only when values or their next adjacent key contain any non-word chars, it can be simplified with some pre-processing:

df.withColumn('value', F.regexp_replace('value','=(\w+)','=<$1>')) \
    .select("*", *[F.regexp_extract('value', r'(?i)\b{0}=<([^>]+)>'.format(k), 1).alias(k) for k in keys]) \
    .show()

Edit-2: added a dictionary to handle key aliases:

keys = ["Name", "Type", "SqVal", "ID", "Loc", "dest"]

# aliases are case-insensitive and added only if exist
key_aliases = {
    'Type': [ 'ThisType', 'AnyName' ],
    'ID': ['conn ID'],
    'Loc': ['conn Loc']
}

# set up regex pattern for each key differently
key_ptns = [ (k, '|'.join([k, *key_aliases[k]]) if k in key_aliases else k) for k in keys ]  
#[('Name', 'Name'),
# ('Type', 'Type|ThisType|AnyName'),
# ('SqVal', 'SqVal'),
# ('ID', 'ID|conn ID'),
# ('Loc', 'Loc|conn Loc'),
# ('dest', 'dest')]  

df.withColumn('value', F.regexp_replace('value','=(\w+)','=<$1>')) \
    .select("*", *[F.regexp_extract('value', r'(?i)\b(?:{0})=<([^>]+)>'.format(p), 1).alias(k) for k,p in key_ptns]) \
    .show()
+--------------------+---------+----+-----+---+---+----+
|               value|     Name|Type|SqVal| ID|Loc|dest|
+--------------------+---------+----+-----+---+---+----+
|otherPartofString...|Series VR|1Ac4|   34|  2|   |    |
|otherPartofString...| Series X| 1B3|   34|  2|sfo| chc|
+--------------------+---------+----+-----+---+---+----+
Sign up to request clarification or add additional context in comments.

13 Comments

BTW. since you are using Spark 2.3 with the case-insensitive search, have issue with pandas_udf. it's probably better choice to use regex_extract to to get values of these keys as shown in one of your previous questions.
Hi @jxc..sorry for late reply. suppose if there the string is 'Name=<Series VR> Location Type=<1Ac4>' i.e., if there is non <=> values, str_to_map is failing.. but the last solution you have provided using 'regexp_extract' is working as expected. thanks again for the detailed explanation and helping me out all the way. you are the best :)
@marc, glad it helps, the function str_to_map has limitations, there should not be any gap between pairs, key/values other than the two delimiters, in your example, you might have to set key=Location Type instead of Type. and map is also not efficient when dealing with case-insensitive search (require some post or pre-processing).
BTW. can you upvote and accept my answer, have a good night! :)
sorry.. forgot accepting. thanks again and have a good night @jxc :)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.