Update dataframe with new data

Question

I'm scraping data and I'd need to save it every time, in order to avoid losing what I have already done. My code is similar to this:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import pandas as pd
from random import randrange

def crawl(df):
    chrome_options = webdriver.ChromeOptions()

    my_list1=[]
    my_list2=[]
    
    # Server info
    
    query=df['Source'].unique().tolist() 
    driver=webdriver.Chrome('path',chrome_options=chrome_options) 
    driver.maximize_window()

    for x in query:
            
        response=driver.get('link_to_scrape/'+x)
        try:
        
            wait = WebDriverWait(driver, 30)
            time.sleep(randrange(5))
            driver.execute_script("window.scrollTo(0, 1000)")
            
            # Get data to append in my_list1
            my1 = wait.until(EC.visibility_of_element_located((By.XPATH, "//div[text()='Trustscore']/../following-sibling::div/descendant::div[@class='icon']"))).text
            my_list1.append(my1)


            # Get data to append in my_list2
            try:
                my2 = wait.until(EC.visibility_of_element_located((By.XPATH, "//div[text()='Company data']/../following-sibling::div/descendant::b[text()='Alexa rank']/../following-sibling::div"))).text
                my_list2.append(my2)
            except: 
                my_list2.append("Data not available")
                 
        except: 
          print("\n!!! ERROR !!!")
          break
                            
    # Create dataframe
    dict = {'Source': query, 'List 1': my_list1, 'List 2': my_list2} 
    df=pd.DataFrame.from_dict(dict)

    driver.quit()


    return df

Currently, the code has some weaknesses that I'd need to fix just by saving data before closing the session for each element in the query. Let's say that I have 5 elements in df['Source']: x1,x2,x3,x4,x5.

When I run my code, x1 is saved, but when the code runs using x2, I get the error: ValueError: arrays must all be the same length, and the process stops. I'd like to fix this issue as follows:

for each unique element in df['Source'], open chrome, extract data, save data into a df, then close the chrome window;
wait for 15 seconds before submitting a new request;
submit a new request: open chrome for the second element in df['Source'], extract data, save data in the same df used previously (for element x1), close chrome.
and so on, until all the elements are in the new df.

In order to keep the extract data, I would need that the df is updated at each step, not at the end, i.e., when crawl has extracted data for all of each item in the list. My code is not doing that: it creates the df at the end, so every time I get an error, I lose my work. In the end, I should have a data frame with 5 rows (excluding the headers), with data extracted (or error message, if it runs the exception). Can you provide me with some help to understand the right way to open/close chrome and save/update the data frame with new data at each iteration? If you need more info, let me know.

Pavan Suvarna · Accepted Answer · 2021-09-08 03:18:13Z

1

Create a dictionary before for loop and then update it with list items and create data frame out of it

frame_dict = {}
for x in query:
    response=driver.get('link_to_scrape/'+x)
    ... 
    Some codes here
    ...
    except: 
         my_list2.append("Data not available")
    frame_dict.update({'Source': x, 'List 1': my_list1, 'List 2': my_list2})

convert frame_dict to dataframe

df=pd.DataFrame.from_dict(frame_dict)

edited Sep 8, 2021 at 3:18

answered Sep 5, 2021 at 16:10

Pavan Suvarna

5013 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

LdM Over a year ago

hi Pavan. I get SyntaxError: invalid syntax in frame_dict.update. It should be frame_dict.update({'Source': x, 'List 1': my_list1, 'List 2': my_list2}). I would need to update the dataframe (I am currently storing in a csv). I think there is still something missing in the loop, since the dataframe (then the csv file) is overwritten, and not updated

LdM Over a year ago

in case you might want to have a look, I have started a bounty here: stackoverflow.com/questions/69080296/…

Collectives™ on Stack Overflow

Update dataframe with new data

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related