pythonic way of extracting desired substring from a bigger string

Question

I have a string like this

msg = b'@\x06string\x083http://schemas.microsoft.com/2003/10/Serialization/\x9a\x05\x18{"PUID":"9279565","Title":"Risk Manager","Description":"<strong>Risk Manager </strong><br />\\n<br />\\nLentech, Inc. is currently seekinga Risk Manager inGreenbelt,"}\x01'

The string {"PUID":"9279565","Title":"Risk Manager","Description":"<strong>Risk Manager </strong><br />\\n<br />\\nLentech, Inc. is currently seekinga Risk Manager inGreenbelt,"} is json parsable. So I come up with the following code to remove garbage strings from the above msg

x1 =  msg.split(b'{"',1)[1]
>>> 
>>> x1
b'PUID":"9279565","Title":"Risk Manager","Description":"<strong>Risk Manager </strong><br />\\n<br />\\nLentech, Inc. is currently seekinga Risk Manager inGreenbelt,"}\x01'
x2 = x1[::-1].split(b'}"', 1)[1][::-1]
>>> x2
b'PUID":"9279565","Title":"Risk Manager","Description":"<strong>Risk Manager </strong><br />\\n<br />\\nLentech, Inc. is currently seekinga Risk Manager inGreenbelt,'
>>> final_msg = b'{"%s"}'%x2
>>> final_msg
b'{"PUID":"9279565","Title":"Risk Manager","Description":"<strong>Risk Manager </strong><br />\\n<br />\\nLentech, Inc. is currently seekinga Risk Manager inGreenbelt,"}'
>>> import json
>>> json.loads(final_msg)
{'Description': "<strong>Risk Manager </strong><br />\\n<br />\\nLentech, Inc. is currently seekinga Risk Manager inGreenbelt,'", 'Title': 'Risk Manager', "b'PUID": '9279565'}

Its a bad way of doing what is required, I would like to know a more optimized way of achieving the result. I think regex can be helpful here but I have a very limited knowledge of regular expressions.

Thanks in advance

There is nothing bad with what you are doing, You just got a messy response (probably not intended to be consumed as a json) so you have to deal with messy ways to extract the data you need — user1767754
– user1767754, Commented Jul 7, 2017 at 8:41
Already asked the problem here - stackoverflow.com/questions/44647351/…, We have decided to go for the 3rd case, as using HTTP protocol have its own limitations — Anurag Sharma
– Anurag Sharma, Commented Jul 7, 2017 at 8:42

sg.sysel · Accepted Answer · 2017-07-07 08:58:22Z

1

There you go:

import re
final_msg = re.search("{.*}", msg).group(0)

answered Jul 7, 2017 at 8:58

sg.sysel

1636 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

sg.sysel Over a year ago

Just be aware that this won't work with a nested dictionary or with multiple JSON objects in one string.

Arun Iyer · Accepted Answer · 2017-07-07 17:56:01Z

0

You can convert byte type to string type first

msg = str(msg)

After which you can write a generator function along with enumeration to pull out the index of the symbol you are searching for

def gen_index(a_string):
    for i,symbol in enumerate(a_string):
        if symbol == '{':
            yield i
    for j , symbol in enumerate(a_string):
       if symbol == '}':
           yield j

 >>>a = list(gen_index(msg))  # returns the array
 >>># use array slicing to output to json. We need the first occurance of '{' and the last occurance of '}'
 import json
 json_output = json.loads(msg[a[0]:a[-1]+1])

edited Jul 7, 2017 at 17:56

answered Jul 7, 2017 at 17:50

Arun Iyer

1191 silver badge7 bronze badges

1 Comment

Arun Iyer Over a year ago

hopefully it will take care of the edge cases where there is a dictionary inside the json. Might work

Collectives™ on Stack Overflow

pythonic way of extracting desired substring from a bigger string

2 Answers 2

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related