0

I am scraping urls and storing complete html data as pandas dataframe to store as csv file and clean.

Code 1

import pandas
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd

url_list = ["https://www.flagstaffsymphony.org/event/masterworks-v-saint-saens-and-bruckner/",
"https://www.berlinerfestspiele.de/de/berliner-festspiele/programm/bfs-gesamtprogramm/programmdetail_341787.html",
"https://www.seattlesymphony.org/en/concerttickets/calendar/2021-2022/21bar3"]


headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.3"
}

driver = webdriver.Chrome('/home/ubuntu/selenium_drivers/chromedriver')
url_data = []
columns = ['url_list','data']
for URL in url_list:
    driver.get(URL)
    driver.implicitly_wait(2)
    data = driver.page_source
    row_data = [URL,data]
    url_data.append(row_data)
html_data = pd.DataFrame(url_data, columns = ['urllist', 'data'])
html_data["parsedata"] = BeautifulSoup(str(html_data["data"]), "lxml").text
cleanr = re.compile('<.*?>')
html_data["cleandata"] = re.sub(cleanr, '', str(html_data["parsedata"]))

But after cleaning html_data["cleandata"] gives some garbage values and not the cleaned data. When tried to clean as individual url it works. How to clean this html data which is stored in pandas dataframe.

1 Answer 1

1

The beautiful soup parser works on a text string, but you've passed a pandas series instead when you do BeautifulSoup(str(html_data["data"]), ...)

The fix is simply to apply the function row-wise to parse and clean the text individually

html_data = pd.DataFrame(url_data, columns = ['urllist', 'data'])
html_data["parsedata"] =  html_data.data.apply(lambda x: BeautifulSoup(x, "lxml").text)
cleanr = re.compile('<.*?>')
html_data["cleandata"] = html_data.parsedata.apply(lambda x: re.sub(cleanr, '', x))

Also, I would recommend that you parse and clean before appending to your list url_data to create the dataframe html_data:

cleanr = re.compile('<.*?>')
for URL in url_list:
    driver.get(URL)
    driver.implicitly_wait(2)
    html = driver.page_source
    soup = BeautifulSoup(html, "lxml")
    cleaned_data = re.sub(cleanr, '', soup.text)
    url_data.append([URL, cleaned_data])

html_data = pd.DataFrame(url_data, columns = ['urllist', 'cleandata'])
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.