I have done a lot of coding to take an HTML item from a page using Beautiful Soup and translate it into JSON. However, I still have one issue: When I open the final JSON files, they all have a backslash before the quotation marks. I know this is because I had to convert the HTML to a string and then use str.replace to do all the formatting. I am looking for a short and simple code to add that will remove the backslashes from the final result.
Here is my code.
Note: The HTML file was saved as the authorID with HTML, so GVcmmoEAAAAJ.html
from bs4 import BeautifulSoup
import json
import os
authorID = "GVcmmoEAAAAJ"
cur_dir = os.getcwd()
new_dir = authorID
path = os.path.join(cur_dir,new_dir)
if not os.path.exists(path):
os.mkdir(path)
html_file2 = open((authorID + ".html"), "rb")
soup = BeautifulSoup(html_file2.read(), 'lxml')
gs_results = soup.find_all('tr', class_= 'gsc_a_tr')
gs_strings = []
for i in gs_results:
item = i
gs_strings.append(str(item))
gs_data = []
for x in range(0, len(gs_strings)):
round1 = gs_strings[x].replace("<tr class=\"gsc_a_tr\"><td class=\"gsc_a_t\"><a class=\"gsc_a_at\" data-href=\"", "IDHASH = {\"DirectURL\":\"https://scholar.google.com")
round2 = round1.replace("\" href=\"javascript:void(0)\">*", "\"")
round3 = round2.replace("\" href=\"javascript:void(0)\">", "\", \"PopupURL\": \"POPUPURLHERE\", \"Title\":\"")
round4 = round3.replace("</a><div class=\"gs_gray\">", "\", \"Authors\":\"")
round5 = round4.replace("</div><div class=\"gs_gray\">", "\", \"Source\":\"")
round6 = round5.replace("</div></td><td class=\"gsc_a_c\"><a class=\"gsc_a_ac gs_ibl\" href=\"", "\", \"CitedBy\":\"")
round7 = round6.replace("<span class=\"gs_oph\">, ", "\", \"SourceYear\":\"")
round8 = round7.replace("</span></td></tr>", "\"}")
round9 = round8.replace("</a></td><td class=\"gsc_a_y\"><span class=\"gsc_a_h gsc_a_hc gs_ibl\">", "\", \"PageYear\":\"")
round10 = round9.replace("</a><span class=\"gsc_a_m\"><a class=\"gsc_a_am\" data-eid=\"", "\", \"DataID\":\"")
round11 = round10.replace("</span>", "")
round12 = round11.replace("<span>", "")
round13 = round12.replace("\"</a></td><td class=\"gsc_a_y\"><span class=\"gsc_a_h gsc_a_hc gs_ibl", "<span class=\"gsc_a_h gsc_a_hc gs_ibl")
round14 = round13.replace("<span class=\"gsc_a_h gsc_a_hc gs_ibl\">", "\", \"PageYear\":\"")
round15 = round14.replace("\">", "\", \"Citations\":\"")
round16 = round15.replace("&", "&")
gs_data.append(round16)
tempdata = gs_data[x]
with open((new_dir + "/" + authorID + "-" + str(x) + ".json"), "w") as new_file:
json.dump(tempdata,new_file)
new_file.close()
html_file2.close()
Here is a sample of 2 what it is opening:
> <tr class="gsc_a_tr"><td class="gsc_a_t"><a class="gsc_a_at"
> data-href="/citations?view_op=view_citation&hl=en&oe=ASCII&user=GVcmmoEAAAAJ&citation_for_view=GVcmmoEAAAAJ:u5HHmVD_uO8C"
> href="javascript:void(0)">Audience response made easy: using personal
> digital assistants as a classroom polling tool</a><div
> class="gs_gray">AS Menon, S Moffett, M Enriquez, MM Martinez, P Dev, T
> Grappone</div><div class="gs_gray">Journal of the American Medical
> Informatics Association 11 (3), 217-220<span class="gs_oph">,
> 2004</span></div></td><td class="gsc_a_c"><a class="gsc_a_ac gs_ibl"
> href="https://scholar.google.com/scholar?oi=bibs&hl=en&oe=ASCII&cites=8886823218645962441">75</a></td><td
> class="gsc_a_y"><span class="gsc_a_h gsc_a_hc
> gs_ibl">2004</span></td></tr>
>
> <tr class="gsc_a_tr"><td class="gsc_a_t"><a class="gsc_a_at"
> data-href="/citations?view_op=view_citation&hl=en&oe=ASCII&user=GVcmmoEAAAAJ&citation_for_view=GVcmmoEAAAAJ:WF5omc3nYNoC"
> href="javascript:void(0)">The UCLA Libraries Affordable Course
> Materials Initiative: Expanding Access, Use, and Affordability of
> Course Materials</a><div class="gs_gray">SE Farb, T Grappone</div><div
> class="gs_gray">Against the Grain 26 (5), 14<span class="gs_oph">,
> 2014</span></div></td><td class="gsc_a_c"><a class="gsc_a_ac gs_ibl"
> href="https://scholar.google.com/scholar?oi=bibs&hl=en&oe=ASCII&cites=3591317356459154717">1</a></td><td
> class="gsc_a_y"><span class="gsc_a_h gsc_a_hc
> gs_ibl">2014</span></td></tr>
Here is how it looks on the screen:
IDHASH = {"DirectURL":"https://scholar.google.com/citations?view_op=view_citation&hl=en&oe=ASCII&user=GVcmmoEAAAAJ&citation_for_view=GVcmmoEAAAAJ:u5HHmVD_uO8C", "PopupURL": "POPUPURLHERE", "Title":"Audience response made easy: using personal digital assistants as a classroom polling tool", "Authors":"AS Menon, S Moffett, M Enriquez, MM Martinez, P Dev, T Grappone", "Source":"Journal of the American Medical Informatics Association 11 (3), 217-220", "SourceYear":"2004", "CitedBy":"https://scholar.google.com/scholar?oi=bibs&hl=en&oe=ASCII&cites=8886823218645962441", "Citations":"75", "PageYear":"2004"}
IDHASH = {"DirectURL":"https://scholar.google.com/citations?view_op=view_citation&hl=en&oe=ASCII&user=GVcmmoEAAAAJ&citation_for_view=GVcmmoEAAAAJ:WF5omc3nYNoC", "PopupURL": "POPUPURLHERE", "Title":"The UCLA Libraries Affordable Course Materials Initiative: Expanding Access, Use, and Affordability of Course Materials", "Authors":"SE Farb, T Grappone", "Source":"Against the Grain 26 (5), 14", "SourceYear":"2014", "CitedBy":"https://scholar.google.com/scholar?oi=bibs&hl=en&oe=ASCII&cites=3591317356459154717", "Citations":"1", "PageYear":"2014"}
That looks good, but when I open the JSON file, this is what I get:
"IDHASH = {\"DirectURL\":\"https://scholar.google.com/citations?view_op=view_citation&hl=en&oe=ASCII&user=GVcmmoEAAAAJ&citation_for_view=GVcmmoEAAAAJ:u5HHmVD_uO8C\", \"PopupURL\": \"POPUPURLHERE\", \"Title\":\"Audience response made easy: using personal digital assistants as a classroom polling tool\", \"Authors\":\"AS Menon, S Moffett, M Enriquez, MM Martinez, P Dev, T Grappone\", \"Source\":\"Journal of the American Medical Informatics Association 11 (3), 217-220\", \"SourceYear\":\"2004\", \"CitedBy\":\"https://scholar.google.com/scholar?oi=bibs&hl=en&oe=ASCII&cites=8886823218645962441\", \"Citations\":\"75\", \"PageYear\":\"2004\"}"
"IDHASH = {\"DirectURL\":\"https://scholar.google.com/citations?view_op=view_citation&hl=en&oe=ASCII&user=GVcmmoEAAAAJ&citation_for_view=GVcmmoEAAAAJ:WF5omc3nYNoC\", \"PopupURL\": \"POPUPURLHERE\", \"Title\":\"The UCLA Libraries Affordable Course Materials Initiative: Expanding Access, Use, and Affordability of Course Materials\", \"Authors\":\"SE Farb, T Grappone\", \"Source\":\"Against the Grain 26 (5), 14\", \"SourceYear\":\"2014\", \"CitedBy\":\"https://scholar.google.com/scholar?oi=bibs&hl=en&oe=ASCII&cites=3591317356459154717\", \"Citations\":\"1\", \"PageYear\":\"2014\"}"
I need to remove the \ marks before the " and just have " throughout.
I converted the original Beautiful Soup results to strings because I could not figure out any other way to revise this, and I needed to keep the HTML coding in places--so I did not just want the screen display table results.
I did look at some related questions, but the answers seemed to deal with classes, which is not what I am doing. I could not make sense of them.
Okay, I revised the code again, and this is working. I had to completely remove the "SourceYear" and merge it with the "Source" field, but that is fine.
html_file2 = open((authorID + ".html"), "r")
soup = BeautifulSoup(html_file2, 'lxml')
gs_results = soup.find_all('tr', class_= 'gsc_a_tr')
gs_lists = []
x = 0
for i in gs_results:
item = i
list_keys = ["DirectURL","Title","Authors","Source","CitedBy","Citations","PageYear"]
initial_link = i.a['data-href']
prefaceURL = "https://scholar.google.com"
gs_lists.append((
prefaceURL + i.a['data-href'],
i.a.text,
i.select_one('.gs_gray').text,
i.select('.gs_gray')[-1].text,
i.select_one('.gsc_a_ac')['href'],
i.select_one('.gsc_a_ac').text,
i.select_one('.gsc_a_y').text
))
with open((new_dir + "/" + authorID + "-" + str(x) + ".json"), "w") as new_file:
new_entry = dict(zip(list_keys,gs_lists[x]))
json.dump(new_entry,new_file)
new_file.close()
x = x+1