I am scraping urls and storing complete html data as pandas dataframe to store as csv file and clean.
Code 1
import pandas
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
url_list = ["https://www.flagstaffsymphony.org/event/masterworks-v-saint-saens-and-bruckner/",
"https://www.berlinerfestspiele.de/de/berliner-festspiele/programm/bfs-gesamtprogramm/programmdetail_341787.html",
"https://www.seattlesymphony.org/en/concerttickets/calendar/2021-2022/21bar3"]
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.3"
}
driver = webdriver.Chrome('/home/ubuntu/selenium_drivers/chromedriver')
url_data = []
columns = ['url_list','data']
for URL in url_list:
driver.get(URL)
driver.implicitly_wait(2)
data = driver.page_source
row_data = [URL,data]
url_data.append(row_data)
html_data = pd.DataFrame(url_data, columns = ['urllist', 'data'])
html_data["parsedata"] = BeautifulSoup(str(html_data["data"]), "lxml").text
cleanr = re.compile('<.*?>')
html_data["cleandata"] = re.sub(cleanr, '', str(html_data["parsedata"]))
But after cleaning html_data["cleandata"] gives some garbage values and not the cleaned data. When tried to clean as individual url it works. How to clean this html data which is stored in pandas dataframe.