1

Am trying to get the output of the print command below into a dictionary (without success) so that I can subsequently export it to a CSV.

How can I get parseddata (output of print below) into a dictionary?

sample input file:

<html>
<body>
<p>{ success:true ,results:3,rows:[{ISIN:"INE134E01011",Ind:"-",Audited:"Un-Audited",Cumulative:"N‌​on-cumulative",Consolidated:"Non-Consolidated",FilingDate:"14-Aug-2015 15:39",SeqNumber:"1001577"},{ISIN:"INE134E01011",Ind:"-",Audited:"Un-Audited",Cu‌​mulative:"Non-cumulative",Consolidated:"Non-Consolidated",FilingDate:"30-May-2015 14:37",SeqNumber:"129901"},{ISIN:"INE134E01011",Ind:"-",Audited:"Un-Audited",Cum‌​ulative:"Non-cumulative",Consolidated:"Non-Consolidated",FilingDate:"17-Feb-2015 14:57",SeqNumber:"126171"}]}</p>
</body>
</html>

my code:

import requests
import re
from bs4 import BeautifulSoup
url = requests.get("http://. . .")
soup = BeautifulSoup(url.text, "lxml")
parseddata = soup.string.split(':[', 1)[1].lstrip(']')
print(parseddata)

the output of print(parseddata) is:

{ISIN:"INE134E01011",Ind:"-",Audited:"Un-Audited",Cumulative:"Non-cumulative",Consolidated:"Non-Consolidated",FilingDate:"14-Aug-2015 15:39",SeqNumber:"1001577"},{ISIN:"INE134E01011",Ind:"-",Audited:"Un-Audited",Cumulative:"Non-cumulative",Consolidated:"Non-Consolidated",FilingDate:"30-May-2015 14:37",SeqNumber:"129901"},{ISIN:"INE134E01011",Ind:"-",Audited:"Un-Audited",Cumulative:"Non-cumulative",Consolidated:"Non-Consolidated",FilingDate:"17-Feb-2015 14:57",SeqNumber:"126171"}]}
4
  • but what does parseddata look like?? Commented Oct 7, 2015 at 22:53
  • yurib, i have edited the post to show what parseddata looks like. thanks Commented Oct 7, 2015 at 22:57
  • @zs_python: can you provide a sample input file to process, such that people can run test cases against it. Commented Oct 7, 2015 at 23:00
  • sample input file added in question above, thanks Commented Oct 7, 2015 at 23:53

2 Answers 2

2

Aside from the stray close brace/bracket at the end, this is valid JSON this is valid YAML (I made a mistake in my initial answer; JavaScript objects can be declared without quoting the properties, but JSON the portable format doesn't allow that; YAML does).

Follow the instructions here to use PyYAML to parse the data. The manual split-ing and lstrip is hurting you and making this harder than it needs to be. Just get the text, then parse with yaml (which is a third party module that must be installed separately):

import requests
import yaml
from bs4 import BeautifulSoup

url = requests.get("http://. . .")
soup = BeautifulSoup(url.text, "lxml")
# Use safe_load over load to avoid opening security holes; YAML can do
# a lot of unsafe things if the input isn't trusted, but handling JS
# object literals can be done safely with safe_load
response_object = yaml.safe_load(soup.string.strip())
data_rows = response_object['rows']

for row in data_rows:
    ... do stuff with each returned row ...

You can read more on the PyYAML tutorial.

Sign up to request clarification or add additional context in comments.

9 Comments

thanks ShadowRanger, i guess the "stray close brace/bracket at the end" is the problem, how do i get rid of it please?
@zs_python: Anticipated that and added an example before you asked. :-)
Odds are, the original data is valid json, just with the object you're interested in as the sole entry in an array attribute of an object with only one attribute (holding the one element array). You could probably just json.loads the whole thing, then access and assign data_as_dict = whole_thing_as_dict['name_of_singleton_key'][0] and avoid your explicit split-ing and lstrip-ing.
Thanks for helping remove the strays ShadowRanger. The above example throws me an error: JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
I have just posted the sample input file in the question so that it gives a clearer picture of what i am trying to parse
|
0

This looks like a key-value mapping, with ISIN a key and "INE134E01011" a value. But it is not JSON, because the keys are not quoted, nor is it YAML because the plain scalar keys (i.e. strings without quotes have to be be followed by colon + space (: ).

If you break the output string in parts ¹:

test_str = (
    '{ISIN:"INE134E01011",Ind:"-",'
    'Audited:"Un-Audited",'
    'Cumulative:"Non-cumulative",'
    'Consolidated:"Non-Consolidated",'
    'FilingDate:"14-Aug-2015 15:39",'
    'SeqNumber:"1001577"},'
    '{ISIN:"INE134E01011",'  # new mapping starts
    'Ind:"-",'
    'Audited:"Un-Audited",'
    'Cumulative:"Non-cumulative",'
    'Consolidated:"Non-Consolidated",'
    'FilingDate:"30-May-2015 14:37",'
    'SeqNumber:"129901"},'
    '{ISIN:"INE134E01011",'    # new mapping starts
    'Ind:"-",'
    'Audited:"Un-Audited",'
    'Cumulative:"Non-cumulative",'
    'Consolidated:"Non-Consolidated",'
    'FilingDate:"17-Feb-2015 14:57",'
    'SeqNumber:"126171"}]}'
)

it test equal to your input:

test_org = '{ISIN:"INE134E01011",Ind:"-",Audited:"Un-Audited",Cumulative:"Non-cumulative",Consolidated:"Non-Consolidated",FilingDate:"14-Aug-2015 15:39",SeqNumber:"1001577"},{ISIN:"INE134E01011",Ind:"-",Audited:"Un-Audited",Cumulative:"Non-cumulative",Consolidated:"Non-Consolidated",FilingDate:"30-May-2015 14:37",SeqNumber:"129901"},{ISIN:"INE134E01011",Ind:"-",Audited:"Un-Audited",Cumulative:"Non-cumulative",Consolidated:"Non-Consolidated",FilingDate:"17-Feb-2015 14:57",SeqNumber:"126171"}]}'
assert test_str == test_org

That split up makes it clear there are actually 3 mappings and that there is a trailing ] and }. The ] indicates that there is a list, which is consistent with having the 3 mappings seperated by comma. The matching [ went missing because you after you split on ':[', you lstrip() it away.

You can easily manipulate the string so YAML can parse it, but the result is a list ²:

import ruamel.yaml
test_str = '[' + test_str.replace(':"', ': "').rstrip('}')

data = ruamel.yaml.load(test_str)
print(type(data))

prints:

<class 'list'>

And since the dicts of which this list consists have keys in common you cannot just combine those without losing information.

You can either map this list to some key (that there is a colon in your split and the output has a trailing } is indication that is in the XML) or you can take a key with unique values (SeqNumber) and promote the value to a key in a dict replacing the list:

ddata = {}
for elem in data:
    k = elem.pop('SeqNumber')
    ddata[k] = elem

but I don't see a reason to go from a list to a dict if your final goal is a CSV file. If you take the output from the YAML parser you can do:

import csv
with open('output.csv', 'w', newline='') as fp:
    csvwriter = csv.writer(fp)
    csvwriter.writerow(data[0].keys())  # header of common dict keys
    for elem in data:
        csvwriter.writerow(elem.values())  # values

to get a CSV file with the following content:

ISIN,Ind,Consolidated,Cumulative,Audited,FilingDate
INE134E01011,-,Non-Consolidated,Non-cumulative,Un-Audited,14-Aug-2015 15:39
INE134E01011,-,Non-Consolidated,Non-cumulative,Un-Audited,30-May-2015 14:37
INE134E01011,-,Non-Consolidated,Non-cumulative,Un-Audited,17-Feb-2015 14:57

¹ Instead of escaping the newlines with \, I use parenthesis to make the multi line definition into one string, that allows me to put comment on the lines more easily
² instead of re-adding the '[', you should of course not strip it in the first place

3 Comments

thanks Anthon. that was perfect, just did the work for me precisely ! really appreciate all the efforts you took to explain it to me too. Thanks @ShadowRanger, your efforts have added to my python learning and were really helpful too. This noob is overwhelmed by the efforts you guys put in to help me learn. Thank you, onwards !
@zs_python If this solves your issue, please consider accepting the answer (by clicking the marker next to top of this answer). That indicates to others that your problem has been solved (they might not read all the way down to your comment), and marks it as such in the database.
thanks @anthon for the hand holding, have accepted the answer as guided. see you guys around soon :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.