0

Problem: I want to remove all the duplicates from my JSON file as I'm displaying the content on a personal website. I provided the JavaScript below on how I'm accessing and displaying the data from the JSON file. I also have the function that's suppose to remove duplicates called drop_duplicates(). It's not actually removing the duplicates.

Also, I know it's much easier to use an API to do all this stuff, I'm just doing this for fun, to understand JSON and web-scraping. I also will be hosting this site and adding it to my portfolio, so if you have any tips regarding that as well I would appreciate it. I will be making another one similar to this with an API in the future.


This is a snippet of my python Script that does the web scraping

    # grabs all the trending quotes for that day
def getTrendingQuotes(browser):
    # wait until trending links appear, not really needed only for example
    all_trendingQuotes = WebDriverWait(browser, 10).until(
        lambda d: d.find_elements_by_css_selector('#trendingQuotes a')
    )
    return [link.get_attribute('href') for link in all_trendingQuotes]


# def drop_duplicates(arr):
#     """ Appends the item to the returned array only if not
#         already present in our dummy array that serves as reference.
#     """
#     selected = []
#     urls = []
#     for item in arr:
#         if item['url'] not in urls:
#             selected.append(item)
#             urls.append(item['url'])

#     print("\n")
#     print(urls)
#     print("\n")
#     print(selected)
#     print("\n")

#     return selected


def getStockDetails(url, browser):

    print(url)
    browser.get(url)

    quote_wrapper = browser.find_element_by_css_selector('div.quote-wrapper')
    quote_name = quote_wrapper.find_element_by_class_name(
        "quote-name").find_element_by_tag_name('h2').text
    quote_price = quote_wrapper.find_element_by_class_name("quote-price").text
    quote_volume = quote_wrapper.find_element_by_class_name(
        "quote-volume").text

    print("\n")
    print("Quote Name: " + quote_name)
    print("Quote Price: " + quote_price)
    print("Quote Volume: " + quote_volume)
    print("\n")

    convertToJson(quote_name, quote_price, quote_volume, url)


quotesArr = []
# Convert to a JSON  file


def convertToJson(quote_name, quote_price, quote_volume, url):

    quoteObject = {
        "url": url,
        "Name": quote_name,
        "Price": quote_price,
        "Volume": quote_volume
    }
    quotesArr.append(quoteObject)


def trendingBot(url, browser):
    browser.get(url)
    trending = getTrendingQuotes(browser)
    for trend in trending:
        getStockDetails(trend, browser)
    # requests finished, write json to file

    quotesArr_dict = {quote['url']: quote for quote in quotesArr}
    quotesArr = list(quotesArr_dict.values())
    print("\n")
    print("\n")
    print("COMPLETED!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
    print(quotesArr)
    print("\n")
    print("\n")
    with open('trendingQuoteData.json', 'w') as outfile:
        json.dump(quotesArr, outfile)


def Main():
    scheduler = BlockingScheduler()
    chrome_options = Options()
    chrome_options.add_argument("--headless")
    # applicable to windows os only
    chrome_options.add_argument('--disable-gpu')

    url = 'https://www.tmxmoney.com/en/index.html'
    browser = webdriver.Chrome(
        chrome_options=chrome_options)

    browser.get(url)

    os.system('cls')
    print("[+] Success! Bot Starting!")
    scheduler.add_job(trendingBot, 'interval', minutes=1,
                      next_run_time=datetime.now(), args=[url, browser])
    scheduler.start()
    # trendingBot(url, browser)
    browser.quit()


if __name__ == "__main__":
    Main()

This is snippet of my JSON file.

[
  {
    "url": "https://web.tmxmoney.com/quote.php?qm_symbol=ACB&locale=EN",
    "Volume": "Volume:\n12,915,903",
    "Price": "$ 7.67",
    "Name": "Aurora Cannabis Inc."
  },

  {
    "url": "https://web.tmxmoney.com/quote.php?qm_symbol=HNL&locale=EN",
    "Volume": "Volume:\n548,038",
    "Price": "$ 1.60",
    "Name": "Horizon North Logistics Inc."
  },
  {
    "url": "https://web.tmxmoney.com/quote.php?qm_symbol=ACB&locale=EN",
    "Volume": "Volume:\n12,915,903",
    "Price": "$ 7.67",
    "Name": "Aurora Cannabis Inc."
  }
]

This is the javaScript that's in my html page.

 var xhttp = new XMLHttpRequest();
    xhttp.onreadystatechange = function() {
      if (this.readyState == 4 && this.status == 200) {
        // Typical action to be performed when the document is ready:
        var response = JSON.parse(xhttp.responseText);
        var output = " ";
        for (var i = 0; i < response.length; i++) {
          output += "<li>" + response[i].Name + ": " + response[i].Price;

          ("</li>");
        }
        document.getElementById("quotes").innerHTML = output;
      }
    };
    xhttp.open("GET", "trendingQuoteData.json", true);
    xhttp.send();

5
  • It's called scraping. "scrapping" means throwing something away. Commented Dec 12, 2018 at 22:25
  • @mkrieger1 you do realize that has nothing to do with what I asked right? All you got out of reading that is that I used the wrong verb? It's a simple typo sir. Commented Dec 12, 2018 at 22:29
  • convertToJson uses drop_duplicates, but I don't see convertToJson being called anywhere. Maybe drop_duplicates is never called at all? Commented Dec 12, 2018 at 22:34
  • convertToJson(quote_name, quote_price, quote_volume, url) just takes in the following arguments. Commented Dec 12, 2018 at 22:38
  • It does get called. It's just a snippet of the code. Commented Dec 12, 2018 at 22:47

1 Answer 1

2

Before you dump quotesArr to json file, do this:

quotesArr_dict = {quote['url']: quote for quote in quotesArr}
quotesArr = list(quotesArr_dict.values())

This two lines should remove all your duplicates in quoteArr.

def trendingBot(url, browser):
browser.get(url)
trending = getTrendingQuotes(browser)
for trend in trending:
    getStockDetails(trend, browser)
quotesArr_dict = {quote['url']: quote for quote in quotesArr}
quotesArr = list(quotesArr_dict.values())
# requests finished, write json to file
with open('trendingQuoteData.json', 'w') as outfile:
    json.dump(quotesArr, outfile)
Sign up to request clarification or add additional context in comments.

8 Comments

So don't even bother with drop_duplicated() then?
Why a function if you can solve it in place with 2 lines of code.
Yes, sorry about that.
I used attr for attributes too many times recently and made that typo.
No problem. So just to clarify everything. This should be placed in my convertoJson() quotesArr.append(quoteObject) quoteesArr_dict = {quote['url']: quote for quote in quotesArr} quotesArr = list(quotesArr_dict.values()) for some reason I get an error that says Using variable 'quotesArr' before assignment `
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.