Multiple Outputs In Pandas Data Frame (Python Web Scraping)

Question

I am currently trying to extract some data from an event webpage, following a tutorial as I've never done this or used Python for this before. It involves extracting the name, date and location of listed events. It seems to either be extracting or outputting the data twice, but I can't see any line of code that would be doing it. Any help would be appreciated!

from time import sleep
from time import time
from random import randint
from bs4 import BeautifulSoup
from requests import get
import pandas

#loop through individual webpages
pages = [str(i) for i in range(1,3)]

url = 'https://www.eventbrite.com/d/malaysia--kuala-lumpur--85675181/all-events/?page=' + str(pages)

name = []
date = []
location = []

start_time = time()
requests = 0

for page in pages:

    response = get(url)

    sleep(randint(1,3))

    requests += 1
    elapsed_time = time() - start_time
    print('Request: {}; Frequency: {} requests/s'.format(requests, requests/elapsed_time))

    if response.status_code != 200:
        warn('Request: {}; Status Code: {}'.format(requests, response.status_code))

    html_soup = BeautifulSoup(response.text, 'html.parser')

    #main div
    event_containers = html_soup.find_all('div', class_ = 'eds-media-card-content__content__principal')

    for container in event_containers:

        #get event name
        event_name = container.h3.div.div.text
        name.append(event_name)

        #get event day & date
        event_date = container.div.div.text
        date.append(event_date)

        #get event location
        event_location = container.find('div', class_ = 'card-text--truncated__one')
        location.append(event_location)

event_list = pandas.DataFrame({
    'event': name,
    'date': date,
    'location': location
})
print(event_list)

chitown88 · Accepted Answer · 2020-03-23 16:09:05Z

2

Nope. nothing in your code that's making the duplicates, but the html source does have it in there twice (don't know why). But you can simply remove duplicate rows.

There is another issue though. You aren't actually looping through each page. You need the url inside your for loop in order to do that:

from time import sleep
from time import time
from random import randint
from bs4 import BeautifulSoup
from requests import get
import pandas

#loop through individual webpages
pages = [str(i) for i in range(1,3)]


name = []
date = []
location = []

start_time = time()
requests = 0

for page in pages:

    url = 'https://www.eventbrite.com/d/malaysia--kuala-lumpur--85675181/all-events/?page=' + str(page)
    response = get(url)

    sleep(randint(1,3))

    requests += 1
    elapsed_time = time() - start_time
    print('Request: {}; Frequency: {} requests/s'.format(requests, requests/elapsed_time))

    if response.status_code != 200:
        print('Request: {}; Status Code: {}'.format(requests, response.status_code))

    html_soup = BeautifulSoup(response.text, 'html.parser')

    #main div
    event_containers = html_soup.find_all('div', class_ = 'eds-media-card-content__content__principal')

    for container in event_containers:

        #get event name
        event_name = container.h3.div.div.text
        name.append(event_name)

        #get event day & date
        event_date = container.div.div.text
        date.append(event_date)

        #get event location
        event_location = container.find('div', class_ = 'card-text--truncated__one')
        location.append(event_location)

event_list = pandas.DataFrame({
    'event': name,
    'date': date,
    'location': location})

event_list = event_list.drop_duplicates()    
print(event_list)

answered Mar 23, 2020 at 16:09

chitown88

29.1k6 gold badges34 silver badges67 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

zero Over a year ago

hi there good day dear Chittown 88 i have installed pandas and runned the code in ATOM on MX.Linux 19.1 - i get back the following issue and complaint: ` Traceback (most recent call last): File "/home/martin/.atom/python/examples/bs_eventbrite_com.py", line 6, in <module> import pandas ImportError: No module named pandas [Finished in 6.144s]`

Shawn Over a year ago

Ah, a silly mistake on my part. But how would I go about removing the duplicate rows then?

chitown88 Over a year ago

@ShawnTheMaroon47, event_list = event_list.drop_duplicates()

Collectives™ on Stack Overflow

Multiple Outputs In Pandas Data Frame (Python Web Scraping)

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related