1

I have done a lot of coding to take an HTML item from a page using Beautiful Soup and translate it into JSON. However, I still have one issue: When I open the final JSON files, they all have a backslash before the quotation marks. I know this is because I had to convert the HTML to a string and then use str.replace to do all the formatting. I am looking for a short and simple code to add that will remove the backslashes from the final result.

Here is my code.

Note: The HTML file was saved as the authorID with HTML, so GVcmmoEAAAAJ.html

from bs4 import BeautifulSoup
import json
import os

authorID = "GVcmmoEAAAAJ"  

cur_dir = os.getcwd()
new_dir = authorID
path = os.path.join(cur_dir,new_dir)
if not os.path.exists(path):
    os.mkdir(path)

html_file2 = open((authorID + ".html"), "rb")
soup = BeautifulSoup(html_file2.read(), 'lxml')

gs_results = soup.find_all('tr', class_= 'gsc_a_tr')

gs_strings = []
for i in gs_results:
    item = i
    gs_strings.append(str(item))

gs_data = []
for x in range(0, len(gs_strings)):
    round1 = gs_strings[x].replace("<tr class=\"gsc_a_tr\"><td class=\"gsc_a_t\"><a class=\"gsc_a_at\" data-href=\"", "IDHASH = {\"DirectURL\":\"https://scholar.google.com")
    round2 = round1.replace("\" href=\"javascript:void(0)\">*", "\"")
    round3 = round2.replace("\" href=\"javascript:void(0)\">", "\", \"PopupURL\": \"POPUPURLHERE\", \"Title\":\"")
    round4 = round3.replace("</a><div class=\"gs_gray\">", "\", \"Authors\":\"")
    round5 = round4.replace("</div><div class=\"gs_gray\">", "\", \"Source\":\"")
    round6 = round5.replace("</div></td><td class=\"gsc_a_c\"><a class=\"gsc_a_ac gs_ibl\" href=\"", "\", \"CitedBy\":\"")
    round7 = round6.replace("<span class=\"gs_oph\">, ", "\", \"SourceYear\":\"")
    round8 = round7.replace("</span></td></tr>", "\"}")
    round9 = round8.replace("</a></td><td class=\"gsc_a_y\"><span class=\"gsc_a_h gsc_a_hc gs_ibl\">", "\", \"PageYear\":\"")
    round10 = round9.replace("</a><span class=\"gsc_a_m\"><a class=\"gsc_a_am\" data-eid=\"", "\", \"DataID\":\"")
    round11 = round10.replace("</span>", "")
    round12 = round11.replace("<span>", "")
    round13 = round12.replace("\"</a></td><td class=\"gsc_a_y\"><span class=\"gsc_a_h gsc_a_hc gs_ibl", "<span class=\"gsc_a_h gsc_a_hc gs_ibl")
    round14 = round13.replace("<span class=\"gsc_a_h gsc_a_hc gs_ibl\">", "\", \"PageYear\":\"")
    round15 = round14.replace("\">", "\", \"Citations\":\"")
    round16 = round15.replace("&amp;", "&")
    
    gs_data.append(round16)
    tempdata = gs_data[x]
    
    with open((new_dir + "/" + authorID + "-" + str(x) + ".json"), "w") as new_file:
        json.dump(tempdata,new_file) 
        
    
    new_file.close()
    
html_file2.close()

Here is a sample of 2 what it is opening:

> <tr class="gsc_a_tr"><td class="gsc_a_t"><a class="gsc_a_at"
> data-href="/citations?view_op=view_citation&amp;hl=en&amp;oe=ASCII&amp;user=GVcmmoEAAAAJ&amp;citation_for_view=GVcmmoEAAAAJ:u5HHmVD_uO8C"
> href="javascript:void(0)">Audience response made easy: using personal
> digital assistants as a classroom polling tool</a><div
> class="gs_gray">AS Menon, S Moffett, M Enriquez, MM Martinez, P Dev, T
> Grappone</div><div class="gs_gray">Journal of the American Medical
> Informatics Association 11 (3), 217-220<span class="gs_oph">,
> 2004</span></div></td><td class="gsc_a_c"><a class="gsc_a_ac gs_ibl"
> href="https://scholar.google.com/scholar?oi=bibs&amp;hl=en&amp;oe=ASCII&amp;cites=8886823218645962441">75</a></td><td
> class="gsc_a_y"><span class="gsc_a_h gsc_a_hc
> gs_ibl">2004</span></td></tr>
> 
> <tr class="gsc_a_tr"><td class="gsc_a_t"><a class="gsc_a_at"
> data-href="/citations?view_op=view_citation&amp;hl=en&amp;oe=ASCII&amp;user=GVcmmoEAAAAJ&amp;citation_for_view=GVcmmoEAAAAJ:WF5omc3nYNoC"
> href="javascript:void(0)">The UCLA Libraries Affordable Course
> Materials Initiative: Expanding Access, Use, and Affordability of
> Course Materials</a><div class="gs_gray">SE Farb, T Grappone</div><div
> class="gs_gray">Against the Grain 26 (5), 14<span class="gs_oph">,
> 2014</span></div></td><td class="gsc_a_c"><a class="gsc_a_ac gs_ibl"
> href="https://scholar.google.com/scholar?oi=bibs&amp;hl=en&amp;oe=ASCII&amp;cites=3591317356459154717">1</a></td><td
> class="gsc_a_y"><span class="gsc_a_h gsc_a_hc
> gs_ibl">2014</span></td></tr>

Here is how it looks on the screen:

IDHASH = {"DirectURL":"https://scholar.google.com/citations?view_op=view_citation&hl=en&oe=ASCII&user=GVcmmoEAAAAJ&citation_for_view=GVcmmoEAAAAJ:u5HHmVD_uO8C", "PopupURL": "POPUPURLHERE", "Title":"Audience response made easy: using personal digital assistants as a classroom polling tool", "Authors":"AS Menon, S Moffett, M Enriquez, MM Martinez, P Dev, T Grappone", "Source":"Journal of the American Medical Informatics Association 11 (3), 217-220", "SourceYear":"2004", "CitedBy":"https://scholar.google.com/scholar?oi=bibs&hl=en&oe=ASCII&cites=8886823218645962441", "Citations":"75", "PageYear":"2004"}

IDHASH = {"DirectURL":"https://scholar.google.com/citations?view_op=view_citation&hl=en&oe=ASCII&user=GVcmmoEAAAAJ&citation_for_view=GVcmmoEAAAAJ:WF5omc3nYNoC", "PopupURL": "POPUPURLHERE", "Title":"The UCLA Libraries Affordable Course Materials Initiative: Expanding Access, Use, and Affordability of Course Materials", "Authors":"SE Farb, T Grappone", "Source":"Against the Grain 26 (5), 14", "SourceYear":"2014", "CitedBy":"https://scholar.google.com/scholar?oi=bibs&hl=en&oe=ASCII&cites=3591317356459154717", "Citations":"1", "PageYear":"2014"}

That looks good, but when I open the JSON file, this is what I get:

"IDHASH = {\"DirectURL\":\"https://scholar.google.com/citations?view_op=view_citation&hl=en&oe=ASCII&user=GVcmmoEAAAAJ&citation_for_view=GVcmmoEAAAAJ:u5HHmVD_uO8C\", \"PopupURL\": \"POPUPURLHERE\", \"Title\":\"Audience response made easy: using personal digital assistants as a classroom polling tool\", \"Authors\":\"AS Menon, S Moffett, M Enriquez, MM Martinez, P Dev, T Grappone\", \"Source\":\"Journal of the American Medical Informatics Association 11 (3), 217-220\", \"SourceYear\":\"2004\", \"CitedBy\":\"https://scholar.google.com/scholar?oi=bibs&hl=en&oe=ASCII&cites=8886823218645962441\", \"Citations\":\"75\", \"PageYear\":\"2004\"}"

"IDHASH = {\"DirectURL\":\"https://scholar.google.com/citations?view_op=view_citation&hl=en&oe=ASCII&user=GVcmmoEAAAAJ&citation_for_view=GVcmmoEAAAAJ:WF5omc3nYNoC\", \"PopupURL\": \"POPUPURLHERE\", \"Title\":\"The UCLA Libraries Affordable Course Materials Initiative: Expanding Access, Use, and Affordability of Course Materials\", \"Authors\":\"SE Farb, T Grappone\", \"Source\":\"Against the Grain 26 (5), 14\", \"SourceYear\":\"2014\", \"CitedBy\":\"https://scholar.google.com/scholar?oi=bibs&hl=en&oe=ASCII&cites=3591317356459154717\", \"Citations\":\"1\", \"PageYear\":\"2014\"}"

I need to remove the \ marks before the " and just have " throughout.

I converted the original Beautiful Soup results to strings because I could not figure out any other way to revise this, and I needed to keep the HTML coding in places--so I did not just want the screen display table results.

I did look at some related questions, but the answers seemed to deal with classes, which is not what I am doing. I could not make sense of them.


Okay, I revised the code again, and this is working. I had to completely remove the "SourceYear" and merge it with the "Source" field, but that is fine.

html_file2 = open((authorID + ".html"), "r")
soup = BeautifulSoup(html_file2, 'lxml')

gs_results = soup.find_all('tr', class_= 'gsc_a_tr')

gs_lists = []
x = 0
for i in gs_results:
    item = i
    list_keys = ["DirectURL","Title","Authors","Source","CitedBy","Citations","PageYear"]
    initial_link = i.a['data-href']
    prefaceURL = "https://scholar.google.com"
    gs_lists.append((
        prefaceURL + i.a['data-href'],
        i.a.text,
        i.select_one('.gs_gray').text,
        i.select('.gs_gray')[-1].text,
        i.select_one('.gsc_a_ac')['href'],
        i.select_one('.gsc_a_ac').text,
        i.select_one('.gsc_a_y').text
    ))
    
    with open((new_dir + "/" + authorID + "-" + str(x) + ".json"), "w") as new_file:
        new_entry = dict(zip(list_keys,gs_lists[x]))
        json.dump(new_entry,new_file)
        
    new_file.close()
    x = x+1
2
  • Please post the complete error ! a part of the error is not helpful for us. also you can debug your code with try/except. Commented Jun 17, 2021 at 22:18
  • @αԋɱҽԃαмєяιcαη It says "for x in soup.select('tr.gsc_a_tr') IndexError: list index out of range". It does not say this if I paste the 'html = """ """' as you did, but I cannot do that. It has to grab it from the html file, which means I need to use a 'for x in range()' type of construction. Also, even after installing pprint, it will not allow any form of 'pp' in my code--even copying and pasting exactly what you had. Commented Jun 18, 2021 at 14:31

1 Answer 1

3
  1. You've inserted a faulty HTML structure which is not equal to the original. I did cleaned it on my end to be able to work on it.

Kindly be informed to copy/paste the HTML code as it's shown on the website or file. as you made it hard for other to be able to help you.

  1. Please try to learn the library which you are using bs4-Documentation

3.You really don't need the big round which you done where you keep replace the string and clear it!

from bs4 import BeautifulSoup
from pprint import pp

html = """<tr class="gsc_a_tr">
    <td class="gsc_a_t"><a class="gsc_a_at" data-href="/citations?view_op=view_citation&amp;hl=en&amp;oe=ASCII&amp;user=GVcmmoEAAAAJ&amp;citation_for_view=GVcmmoEAAAAJ:u5HHmVD_uO8C" href="javascript:void(0)">Audience response made easy: using personal digital assistants as a classroom polling tool</a>
        <div class="gs_gray">AS Menon, S Moffett, M Enriquez, MM Martinez, P Dev, T Grappone</div>
        <div class="gs_gray">Journal of the American Medical Informatics Association 11 (3), 217-220<span class="gs_oph">,
        2004</span></div>
    </td>
    <td class="gsc_a_c"><a class="gsc_a_ac gs_ibl" href="https://scholar.google.com/scholar?oi=bibs&amp;hl=en&amp;oe=ASCII&amp;cites=8886823218645962441">75</a></td>
    <td class="gsc_a_y"><span class="gsc_a_h gsc_a_hc
gs_ibl">2004</span></td>
</tr>
<tr class="gsc_a_tr">
    <td class="gsc_a_t"><a class="gsc_a_at" data-href="/citations?view_op=view_citation&amp;hl=en&amp;oe=ASCII&amp;user=GVcmmoEAAAAJ&amp;citation_for_view=GVcmmoEAAAAJ:WF5omc3nYNoC" href="javascript:void(0)">The UCLA Libraries Affordable Course Materials Initiative: Expanding Access, Use, and Affordability of Course Materials</a>
        <div class="gs_gray">SE Farb, T Grappone</div>
        <div class="gs_gray">Against the Grain 26 (5), 14<span class="gs_oph">,
        2014</span></div>
    </td>
    <td class="gsc_a_c"><a class="gsc_a_ac gs_ibl" href="https://scholar.google.com/scholar?oi=bibs&amp;hl=en&amp;oe=ASCII&amp;cites=3591317356459154717">1</a></td>
    <td class="gsc_a_y"><span class="gsc_a_h gsc_a_hcgs_ibl">2014</span></td>
</tr>"""


soup = BeautifulSoup(html, 'lxml')
goal = [
    (
        x.a['data-href'],
        x.a.text,
        x.select_one('.gs_gray').text,
        x.select('.gs_gray')[-1].text.rsplit(',', 1)[0],
        x.select('.gs_gray')[-1].text.rsplit(',', 1)[1].strip(),
        x.select_one('.gsc_a_ac')['href'],
        x.select_one('.gsc_a_ac').text,
        x.select_one('.gsc_a_y').text
    )
    for x in soup.select('tr.gsc_a_tr')
]
pp(goal, indent=2)

Ask your self why bs4 PARSER is created ??

Output:

[ ( '/citations?view_op=view_citation&hl=en&oe=ASCII&user=GVcmmoEAAAAJ&citation_for_view=GVcmmoEAAAAJ:u5HHmVD_uO8C',
    'Audience response made easy: using personal digital assistants as a '
    'classroom polling tool',
    'AS Menon, S Moffett, M Enriquez, MM Martinez, P Dev, T Grappone',
    'Journal of the American Medical Informatics Association 11 (3), 217-220',
    '2004',
    'https://scholar.google.com/scholar?oi=bibs&hl=en&oe=ASCII&cites=8886823218645962441',
    '75',
    '2004'),
  ( '/citations?view_op=view_citation&hl=en&oe=ASCII&user=GVcmmoEAAAAJ&citation_for_view=GVcmmoEAAAAJ:WF5omc3nYNoC',
    'The UCLA Libraries Affordable Course Materials Initiative: Expanding '
    'Access, Use, and Affordability of Course Materials',
    'SE Farb, T Grappone',
    'Against the Grain 26 (5), 14',
    '2014',
    'https://scholar.google.com/scholar?oi=bibs&hl=en&oe=ASCII&cites=3591317356459154717',
    '1',
    '2014')]

Now you do have a list of tuples ! feel free to assign keys and convert to dict.

Sign up to request clarification or add additional context in comments.

3 Comments

It is not working. I tried to replicate it, and it says 'IndexError: list index out of range'
I could not format it as comments only allow a very limited number of characters, and I could not see any way to mark them as code. Also, each HTML file will have anywhere from a few to several hundred items, and each item has to be saved as its own JSON file eventually. That is why I had it saving as items in a list. However, your code looks a LOT better than mine ... if I can figure it out.
@mdign002 comment section is not for discussing code, you should edit your question and include the last update which you got. but for you information ! you are reading the file as bytes why ? rb = read bytes !

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.