Removing duplicates from an Array or JSON file not working

Question

Problem: I want to remove all the duplicates from my JSON file as I'm displaying the content on a personal website. I provided the JavaScript below on how I'm accessing and displaying the data from the JSON file. I also have the function that's suppose to remove duplicates called drop_duplicates(). It's not actually removing the duplicates.

Also, I know it's much easier to use an API to do all this stuff, I'm just doing this for fun, to understand JSON and web-scraping. I also will be hosting this site and adding it to my portfolio, so if you have any tips regarding that as well I would appreciate it. I will be making another one similar to this with an API in the future.

This is a snippet of my python Script that does the web scraping

    # grabs all the trending quotes for that day
def getTrendingQuotes(browser):
    # wait until trending links appear, not really needed only for example
    all_trendingQuotes = WebDriverWait(browser, 10).until(
        lambda d: d.find_elements_by_css_selector('#trendingQuotes a')
    )
    return [link.get_attribute('href') for link in all_trendingQuotes]


# def drop_duplicates(arr):
#     """ Appends the item to the returned array only if not
#         already present in our dummy array that serves as reference.
#     """
#     selected = []
#     urls = []
#     for item in arr:
#         if item['url'] not in urls:
#             selected.append(item)
#             urls.append(item['url'])

#     print("\n")
#     print(urls)
#     print("\n")
#     print(selected)
#     print("\n")

#     return selected


def getStockDetails(url, browser):

    print(url)
    browser.get(url)

    quote_wrapper = browser.find_element_by_css_selector('div.quote-wrapper')
    quote_name = quote_wrapper.find_element_by_class_name(
        "quote-name").find_element_by_tag_name('h2').text
    quote_price = quote_wrapper.find_element_by_class_name("quote-price").text
    quote_volume = quote_wrapper.find_element_by_class_name(
        "quote-volume").text

    print("\n")
    print("Quote Name: " + quote_name)
    print("Quote Price: " + quote_price)
    print("Quote Volume: " + quote_volume)
    print("\n")

    convertToJson(quote_name, quote_price, quote_volume, url)


quotesArr = []
# Convert to a JSON  file


def convertToJson(quote_name, quote_price, quote_volume, url):

    quoteObject = {
        "url": url,
        "Name": quote_name,
        "Price": quote_price,
        "Volume": quote_volume
    }
    quotesArr.append(quoteObject)


def trendingBot(url, browser):
    browser.get(url)
    trending = getTrendingQuotes(browser)
    for trend in trending:
        getStockDetails(trend, browser)
    # requests finished, write json to file

    quotesArr_dict = {quote['url']: quote for quote in quotesArr}
    quotesArr = list(quotesArr_dict.values())
    print("\n")
    print("\n")
    print("COMPLETED!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
    print(quotesArr)
    print("\n")
    print("\n")
    with open('trendingQuoteData.json', 'w') as outfile:
        json.dump(quotesArr, outfile)


def Main():
    scheduler = BlockingScheduler()
    chrome_options = Options()
    chrome_options.add_argument("--headless")
    # applicable to windows os only
    chrome_options.add_argument('--disable-gpu')

    url = 'https://www.tmxmoney.com/en/index.html'
    browser = webdriver.Chrome(
        chrome_options=chrome_options)

    browser.get(url)

    os.system('cls')
    print("[+] Success! Bot Starting!")
    scheduler.add_job(trendingBot, 'interval', minutes=1,
                      next_run_time=datetime.now(), args=[url, browser])
    scheduler.start()
    # trendingBot(url, browser)
    browser.quit()


if __name__ == "__main__":
    Main()

This is snippet of my JSON file.

[
  {
    "url": "https://web.tmxmoney.com/quote.php?qm_symbol=ACB&locale=EN",
    "Volume": "Volume:\n12,915,903",
    "Price": "$ 7.67",
    "Name": "Aurora Cannabis Inc."
  },

  {
    "url": "https://web.tmxmoney.com/quote.php?qm_symbol=HNL&locale=EN",
    "Volume": "Volume:\n548,038",
    "Price": "$ 1.60",
    "Name": "Horizon North Logistics Inc."
  },
  {
    "url": "https://web.tmxmoney.com/quote.php?qm_symbol=ACB&locale=EN",
    "Volume": "Volume:\n12,915,903",
    "Price": "$ 7.67",
    "Name": "Aurora Cannabis Inc."
  }
]

This is the javaScript that's in my html page.

 var xhttp = new XMLHttpRequest();
    xhttp.onreadystatechange = function() {
      if (this.readyState == 4 && this.status == 200) {
        // Typical action to be performed when the document is ready:
        var response = JSON.parse(xhttp.responseText);
        var output = " ";
        for (var i = 0; i < response.length; i++) {
          output += "<li>" + response[i].Name + ": " + response[i].Price;

          ("</li>");
        }
        document.getElementById("quotes").innerHTML = output;
      }
    };
    xhttp.open("GET", "trendingQuoteData.json", true);
    xhttp.send();

It's called scraping. "scrapping" means throwing something away. — mkrieger1
– mkrieger1, Commented Dec 12, 2018 at 22:25
@mkrieger1 you do realize that has nothing to do with what I asked right? All you got out of reading that is that I used the wrong verb? It's a simple typo sir. — pennyBoy
– pennyBoy, Commented Dec 12, 2018 at 22:29
convertToJson uses drop_duplicates, but I don't see convertToJson being called anywhere. Maybe drop_duplicates is never called at all? — mkrieger1
– mkrieger1, Commented Dec 12, 2018 at 22:34
convertToJson(quote_name, quote_price, quote_volume, url) just takes in the following arguments. — pennyBoy
– pennyBoy, Commented Dec 12, 2018 at 22:38

Q. Qiao · Accepted Answer · 2018-12-13 15:56:49Z

2

Before you dump quotesArr to json file, do this:

quotesArr_dict = {quote['url']: quote for quote in quotesArr}
quotesArr = list(quotesArr_dict.values())

This two lines should remove all your duplicates in quoteArr.

def trendingBot(url, browser):
browser.get(url)
trending = getTrendingQuotes(browser)
for trend in trending:
    getStockDetails(trend, browser)
quotesArr_dict = {quote['url']: quote for quote in quotesArr}
quotesArr = list(quotesArr_dict.values())
# requests finished, write json to file
with open('trendingQuoteData.json', 'w') as outfile:
    json.dump(quotesArr, outfile)

edited Dec 13, 2018 at 15:56

answered Dec 12, 2018 at 23:03

Q. Qiao

8477 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

pennyBoy Over a year ago

So don't even bother with drop_duplicated() then?

Q. Qiao Over a year ago

Why a function if you can solve it in place with 2 lines of code.

Q. Qiao Over a year ago

Yes, sorry about that.

Q. Qiao Over a year ago

I used attr for attributes too many times recently and made that typo.

pennyBoy Over a year ago

No problem. So just to clarify everything. This should be placed in my convertoJson()

quotesArr.append(quoteObject)   quoteesArr_dict = {quote['url']: quote for quote in quotesArr}     quotesArr = list(quotesArr_dict.values())

for some reason I get an error that says Using variable 'quotesArr' before assignment `

|

Collectives™ on Stack Overflow

Removing duplicates from an Array or JSON file not working

1 Answer 1

8 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Related