1

I am currently trying to extract some data from an event webpage, following a tutorial as I've never done this or used Python for this before. It involves extracting the name, date and location of listed events. It seems to either be extracting or outputting the data twice, but I can't see any line of code that would be doing it. Any help would be appreciated!

from time import sleep
from time import time
from random import randint
from bs4 import BeautifulSoup
from requests import get
import pandas

#loop through individual webpages
pages = [str(i) for i in range(1,3)]

url = 'https://www.eventbrite.com/d/malaysia--kuala-lumpur--85675181/all-events/?page=' + str(pages)

name = []
date = []
location = []

start_time = time()
requests = 0

for page in pages:

    response = get(url)

    sleep(randint(1,3))

    requests += 1
    elapsed_time = time() - start_time
    print('Request: {}; Frequency: {} requests/s'.format(requests, requests/elapsed_time))

    if response.status_code != 200:
        warn('Request: {}; Status Code: {}'.format(requests, response.status_code))

    html_soup = BeautifulSoup(response.text, 'html.parser')

    #main div
    event_containers = html_soup.find_all('div', class_ = 'eds-media-card-content__content__principal')

    for container in event_containers:

        #get event name
        event_name = container.h3.div.div.text
        name.append(event_name)

        #get event day & date
        event_date = container.div.div.text
        date.append(event_date)

        #get event location
        event_location = container.find('div', class_ = 'card-text--truncated__one')
        location.append(event_location)

event_list = pandas.DataFrame({
    'event': name,
    'date': date,
    'location': location
})
print(event_list)

Pandas DataFrame Output

1 Answer 1

2

Nope. nothing in your code that's making the duplicates, but the html source does have it in there twice (don't know why). But you can simply remove duplicate rows.

There is another issue though. You aren't actually looping through each page. You need the url inside your for loop in order to do that:

from time import sleep
from time import time
from random import randint
from bs4 import BeautifulSoup
from requests import get
import pandas

#loop through individual webpages
pages = [str(i) for i in range(1,3)]


name = []
date = []
location = []

start_time = time()
requests = 0

for page in pages:

    url = 'https://www.eventbrite.com/d/malaysia--kuala-lumpur--85675181/all-events/?page=' + str(page)
    response = get(url)

    sleep(randint(1,3))

    requests += 1
    elapsed_time = time() - start_time
    print('Request: {}; Frequency: {} requests/s'.format(requests, requests/elapsed_time))

    if response.status_code != 200:
        print('Request: {}; Status Code: {}'.format(requests, response.status_code))

    html_soup = BeautifulSoup(response.text, 'html.parser')

    #main div
    event_containers = html_soup.find_all('div', class_ = 'eds-media-card-content__content__principal')

    for container in event_containers:

        #get event name
        event_name = container.h3.div.div.text
        name.append(event_name)

        #get event day & date
        event_date = container.div.div.text
        date.append(event_date)

        #get event location
        event_location = container.find('div', class_ = 'card-text--truncated__one')
        location.append(event_location)

event_list = pandas.DataFrame({
    'event': name,
    'date': date,
    'location': location})

event_list = event_list.drop_duplicates()    
print(event_list)
Sign up to request clarification or add additional context in comments.

3 Comments

hi there good day dear Chittown 88 i have installed pandas and runned the code in ATOM on MX.Linux 19.1 - i get back the following issue and complaint: ` Traceback (most recent call last): File "/home/martin/.atom/python/examples/bs_eventbrite_com.py", line 6, in <module> import pandas ImportError: No module named pandas [Finished in 6.144s]`
Ah, a silly mistake on my part. But how would I go about removing the duplicate rows then?
@ShawnTheMaroon47, event_list = event_list.drop_duplicates()

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.