0

I have a web scrapped string containing key value pairs i.e firstName:"Quaran", lastName:"McPherson"

st = '{"accountId":405266,"firstName":"Quaran","lastName":"McPherson","accountIdentifier":"StudentAthlete","profilePicUrl":"https://pbs.twimg.com/profile_images/1331475329014181888/4z19KrCf.jpg","networkProfileCode":"quaran-mcpherson","hasDeals":true,"activityMin":11,"sports":["Men\'s Basketball","Basketball"],"currentTeams":["Nebraska Cornhuskers"],"previousTeams":[],"facebookReach":null,"twitterReach":619,"instagramReach":0,"linkedInReach":null},{"accountId":375964,"firstName":"Micole","lastName":"Cayton","accountIdentifier":"StudentAthlete","profilePicUrl":"https://opendorsepr.blob.core.windows.net/media/375964/20220622223838_46dbe3fd-a683-436b-84d4-90c84a5af35f.jpg","networkProfileCode":"micole-cayton","hasDeals":true,"activityMin":16,"sports":["Basketball","Women\'s Basketball"],"currentTeams":["Minnesota Golden Gophers"],"previousTeams":["Cal Berkeley Golden Bears"],"facebookReach":0,"twitterReach":1273,"instagramReach":5700,"linkedInReach":null}'

I am trying to extract the first_name, last_name and a few other parameters from this string in list format such that I will be having a first_name list with all first_names from the string

I tried using re.findall('"firstName":'"(.*)\S$",st) to access the text "Quaran" but result is coming in the following format

'"Quaran","lastName":"McPherson","accountIdentifier":"StudentAthlete","profilePicUrl":"https://pbs.twimg.com/profile_images/1331475329014181888/4z19KrCf.jpg","networkProfileCode":"quaran-mcpherson","hasDeals":true,"activityMin":11,"sports":["Men\'s Basketball","Basketball"],"currentTeams":["Nebraska Cornhuskers"],"previousTeams":[],"facebookReach":null,"twitterReach":619,"instagramReach":0,"linkedInReach":null}

how do I end the specify within the regex to end the search at the end of the name in quotes??

TIA

3 Answers 3

1

Your string seems JSON array, you can easily parse json in any language if it's valid. To make your string valid add '[' at first and ']' at last of your string then parse the JSON in your language. Such as

JavaScript:

JSON.parse(st)

Python:

import json
dict = json.loads(st)

Regular expression:

if you strictly wish to parse using regular expression use:

/(?:\"|\')(?<key>[\w\d]+)(?:\"|\')(?:\:\s*)(?:\"|\')?(?<value>[\w\s-]*)(?:\"|\')?/gm
Sign up to request clarification or add additional context in comments.

Comments

1

Try this:

(?<="firstName":")[^"\r\n]+

(?<="firstName":") go to the point where "firstName":" appeasrs in the string,

[^"\r\n]+ then match one or more character except ", \r and \n. not to cross the second double quote of the firstName value and not to cross any newline.

See regex demo.

See python demo.

Comments

1

Try this regex (?<=\"firstName\":\").*?(?=\"). The ? in the middle makes it a lazy match, so that it stops matching as soon as it finds a " character.

I want to add that that with backtracking, there can be an exponential performance penalty. I can do something like this "firstName":"(.*?)" and extract the capturing group so that there will only a linear performance penalty.

Regex: https://regex101.com/r/uM2l8M/1

Python code

import re
st = '{"accountId":405266,"firstName":"Quaran","lastName":"McPherson","accountIdentifier":"StudentAthlete","profilePicUrl":"https://pbs.twimg.com/profile_images/1331475329014181888/4z19KrCf.jpg","networkProfileCode":"quaran-mcpherson","hasDeals":true,"activityMin":11,"sports":["Men\'s Basketball","Basketball"],"currentTeams":["Nebraska Cornhuskers"],"previousTeams":[],"facebookReach":null,"twitterReach":619,"instagramReach":0,"linkedInReach":null},{"accountId":375964,"firstName":"Micole","lastName":"Cayton","accountIdentifier":"StudentAthlete","profilePicUrl":"https://opendorsepr.blob.core.windows.net/media/375964/20220622223838_46dbe3fd-a683-436b-84d4-90c84a5af35f.jpg","networkProfileCode":"micole-cayton","hasDeals":true,"activityMin":16,"sports":["Basketball","Women\'s Basketball"],"currentTeams":["Minnesota Golden Gophers"],"previousTeams":["Cal Berkeley Golden Bears"],"facebookReach":0,"twitterReach":1273,"instagramReach":5700,"linkedInReach":null}'
pattern = re.compile('"firstName":"(.*?)"')

for match in pattern.finditer(st):
    print(match.group(1))

2 Comments

Thanks! this worked. One question, what if the substring I am trying to match with is a number ? like instead of 'firstName' I want to extract the 'instagramReach' which is stored as a number? I am unable to capture it right now with the .*?(?= bit, do you have a suggestion here? Thanks in advance!
The end character is not a quotation mark, so you would change it to a comma. (?<=\"instagramReach\":).*?(?=,)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.