I'm scraping data and I'd need to save it every time, in order to avoid losing what I have already done. My code is similar to this:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import pandas as pd
from random import randrange
def crawl(df):
chrome_options = webdriver.ChromeOptions()
my_list1=[]
my_list2=[]
# Server info
query=df['Source'].unique().tolist()
driver=webdriver.Chrome('path',chrome_options=chrome_options)
driver.maximize_window()
for x in query:
response=driver.get('link_to_scrape/'+x)
try:
wait = WebDriverWait(driver, 30)
time.sleep(randrange(5))
driver.execute_script("window.scrollTo(0, 1000)")
# Get data to append in my_list1
my1 = wait.until(EC.visibility_of_element_located((By.XPATH, "//div[text()='Trustscore']/../following-sibling::div/descendant::div[@class='icon']"))).text
my_list1.append(my1)
# Get data to append in my_list2
try:
my2 = wait.until(EC.visibility_of_element_located((By.XPATH, "//div[text()='Company data']/../following-sibling::div/descendant::b[text()='Alexa rank']/../following-sibling::div"))).text
my_list2.append(my2)
except:
my_list2.append("Data not available")
except:
print("\n!!! ERROR !!!")
break
# Create dataframe
dict = {'Source': query, 'List 1': my_list1, 'List 2': my_list2}
df=pd.DataFrame.from_dict(dict)
driver.quit()
return df
Currently, the code has some weaknesses that I'd need to fix just by saving data before closing the session for each element in the query.
Let's say that I have 5 elements in df['Source']: x1,x2,x3,x4,x5.
When I run my code, x1 is saved, but when the code runs using x2, I get the error: ValueError: arrays must all be the same length, and the process stops. I'd like to fix this issue as follows:
- for each unique element in
df['Source'], open chrome, extract data, save data into a df, then close the chrome window; - wait for 15 seconds before submitting a new request;
- submit a new request: open chrome for the second element in
df['Source'], extract data, save data in the same df used previously (for element x1), close chrome. - and so on, until all the elements are in the new df.
In order to keep the extract data, I would need that the df is updated at each step, not at the end, i.e., when crawl has extracted data for all of each item in the list. My code is not doing that: it creates the df at the end, so every time I get an error, I lose my work. In the end, I should have a data frame with 5 rows (excluding the headers), with data extracted (or error message, if it runs the exception). Can you provide me with some help to understand the right way to open/close chrome and save/update the data frame with new data at each iteration? If you need more info, let me know.