Getting xpath indices to count forward to preserve table structure

Question

Scraping tables from the web get complicated when there are 2 or more values in a cell. In order to preserve the table structure, I have devised a way to count the row-number index of its xpath, implementing a nested list when the row number stays the same.

    def get_structured_elements(name):
        """For target data that is nested and structured,
        such as a table with multiple values in a single cell."""
        driver = self.driver
    
        i = 2 # keep track of 'i' to retain the document structure.
        number_of_items ＝ number_of_items_found()
        elements = [None] * number_of_items # len(elements) will exceed number_of_items.

        target_data = driver.find_elements("//table/tbody/tr[" + i + "]/td[2]/a")
    
        while i - 2 < number_of_items:
            for item in target_data:
                # print(item.text, i-1)
                if elements[i - 2] == None:
                    elements[i - 2] = item.text # set to item.text value if position is empty. 
                else:
                    elements[i - 2] = [elements[i - 2]] 
                    elements[i - 2].append(item.text) # make nested list and append new value if position is occupied.
            i += 1
    
        return elements

This simple logic was working fine, until I sought to manage all locator variables in one place to make the code more reusable: How do I store this expression "//table/tbody/tr[" + i + "]/td[2]/a" in a list or dictionary so that it still works when plugged in?

The solution (i.e. hack) I came up with is a function that takes in the front and back half of the iterating xpath as arguments, returning front_half + str(i) + back_half if i is part of the parent (iterator) function's local variable.

def split_xpath_at_i(front_half, back_half):
    """Splits xpath string at its counter index. 
    The 'else' part is to aviod errors 
    when this function is called outside an indexed environment. """
    
    if 'i' in locals():
        string = front_half + str(i) + back_half
    else:
        string = front_half+"SPLIT_i"+back_half

    return string

xpath = [split_xpath_at_i("//table/tbody/tr[","]/td[2]/a"), 
         "//table/tbody/tr/td[3]/a[1]"
        ]

def xpath_index_iterator():
    for i in range(10):
        print(split_xpath_at_i("//table/tbody/tr[","]/td[2]/a"))


xpath_index_iterator()

# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a

Problem is, split_xpath_at_i is blind to variables in its immediate environment. What I eventually came up with is to make use of the iterator function's attribute to define the counter i so that the variable can be made available to split_xpath_at_i like so:

def split_xpath_at_i(front_half, back_half):
    """Splits xpath string at its counter index. 
    The 'else' part is to aviod errors 
    when this function is called outside an indexed environment. """
    try:
        i = xpath_index_iterator.i
    except:
        pass
    
    if 'i' in locals():
        string = front_half + str(i) + back_half
    else:
        string = front_half+"SPLIT_i"+back_half

    return string

xpath = [split_xpath_at_i("//table/tbody/tr[","]/td[2]/a"), 
         "//table/tbody/tr/td[3]/a[1]"
        ]
    
def xpath_index_iterator():
    xpath_index_iterator.i = 0
    lst = []
    for xpath_index_iterator.i in range(10):
        print(split_xpath_at_i("//table/tbody/tr[","]/td[2]/a"))

xpath_index_iterator()

# //table/tbody/tr[0]/td[2]/a
# //table/tbody/tr[1]/td[2]/a
# //table/tbody/tr[2]/td[2]/a
# //table/tbody/tr[3]/td[2]/a
# //table/tbody/tr[4]/td[2]/a
# //table/tbody/tr[5]/td[2]/a
# //table/tbody/tr[6]/td[2]/a
# //table/tbody/tr[7]/td[2]/a
# //table/tbody/tr[8]/td[2]/a
# //table/tbody/tr[9]/td[2]/a

The problem gets more complicated when I try to invoke split_xpath_at_i via a locator list:

def split_xpath_at_i(front_half, back_half):
    """Splits xpath string at its counter index. 
    The 'else' part is to aviod errors 
    when this function is called outside an indexed environment. """
    try:
        i = xpath_index_iterator.i
    except:
        pass
    
    if 'i' in locals():
        string = front_half + str(i) + back_half
    else:
        string = front_half+"SPLIT_i"+back_half

    return string

xpath = [split_xpath_at_i("//table/tbody/tr[","]/td[2]/a"), 
         "//table/tbody/tr/td[3]/a[1]"
        ]
    
def xpath_index_iterator():
    xpath_index_iterator.i = 0
    lst = []
    for xpath_index_iterator.i in range(10):
#         print(split_xpath_at_i("//table/tbody/tr[","]/td[2]/a"))
        lst.append(xpath[0])
    return lst

xpath_index_iterator()

# ['//table/tbody/tr[9]/td[2]/a',
#  '//table/tbody/tr[9]/td[2]/a',
#  '//table/tbody/tr[9]/td[2]/a',
#  '//table/tbody/tr[9]/td[2]/a',
#  '//table/tbody/tr[9]/td[2]/a',
#  '//table/tbody/tr[9]/td[2]/a',
#  '//table/tbody/tr[9]/td[2]/a',
#  '//table/tbody/tr[9]/td[2]/a',
#  '//table/tbody/tr[9]/td[2]/a',
#  '//table/tbody/tr[9]/td[2]/a']

What would a professional approach to this problem look like?

The Entire Code:

The code below was modified from the Selenium manual.

I've asked a related question over here that concerns the general approach to Page Objects design.

test.py

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

from query import Input
import page

cnki = Input()
driver = cnki.webpage('http://big5.oversea.cnki.net/kns55/')

current_page = page.MainPage(driver)
current_page.submit_search('禮學')
current_page.switch_to_frame()
result = page.SearchResults(driver)

structured = result.get_structured_elements('titles') # I couldn't get this to work.
simple = result.simple_get_structured_elements() # but this works fine.

query.py

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from selenium import webdriver

class Input:
    """This class provides a wrapper around actual working code."""
    
    # CONSTANTS
    
    URL = None
        
    def __init__(self):
        self.driver = webdriver.Chrome
    
    def webpage(self, url):
        driver = self.driver()
        driver.get(url)
        
        return driver

page.py

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from element import BasePageElement
from locators import InputLocators, OutputLocators
from selenium.common.exceptions import TimeoutException, WebDriverException
from selenium.common.exceptions import StaleElementReferenceException
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.support.ui import WebDriverWait


class SearchTextElement(BasePageElement):
    """This class gets the search text from the specified locator"""

    #The locator for search box where search string is entered
    locator = None


class BasePage:
    """Base class to initialize the base page that will be called from all
    pages"""

    def __init__(self, driver):
        self.driver = driver

class MainPage(BasePage):
    """Home page action methods come here. I.e. Python.org"""

    search_keyword = SearchTextElement()
    
    def submit_search(self, keyword):
        """Submits keyword and triggers the search"""
        SearchTextElement.locator = InputLocators.SEARCH_FIELD
        self.search_keyword = keyword

    def select_dropdown_item(self, item):
        driver = self.driver
        by, val = InputLocators.SEARCH_ATTR
        driver.find_element(by, val + "/option[text()='" + item + "']").click()

    def click_search_button(self):
        driver = self.driver
        element = driver.find_element(*InputLocators.SEARCH_BUTTON)
        element.click()
        
    def switch_to_frame(self):
        """Use this function to get access to hidden elements. """
        driver = self.driver
        driver.switch_to.default_content()
        driver.switch_to.frame('iframeResult')

    # Maximize the number of items on display in the search results.
    def max_content(self):
        driver = self.driver
        max_content = driver.find_element_by_css_selector('#id_grid_display_num > a:nth-child(3)')
        max_content.click()
    
    
    def stop_loading_page_when_element_is_present(self, locator):
        driver = self.driver
        
        ignored_exceptions = (NoSuchElementException, StaleElementReferenceException)
        wait = WebDriverWait(driver, 30, ignored_exceptions=ignored_exceptions)
    
        wait.until(
            EC.presence_of_element_located(locator))
        driver.execute_script("window.stop();")


    def next_page(self):
        driver = self.driver

        self.stop_loading_page_when_element_is_present(InputLocators.NEXT_PAGE)
        driver.execute_script("window.stop();")
    
        try:
            driver.find_element(*InputLocators.NEXT_PAGE).click()
            print("Navigating to Next Page")
        except (TimeoutException, WebDriverException):
            print("Last page reached")
        
        
  
        
class SearchResults(BasePage):
    """Search results page action methods come here"""

    def __init__(self, driver):
        self.driver = driver
        i = None # get_structured_element counter
        
    def wait_for_page_to_load(self):
        driver = self.driver
        wait = WebDriverWait(driver, 100)
        wait.until(
            EC.presence_of_element_located(*InputLocators.MAIN_BODY))
    
    def get_single_element(self, name):
        """Returns a single value as target data."""
        driver = self.driver
        target_data = driver.find_element(*OutputLocators.CNKI[str(name.upper())])
        # SearchTextElement.locator = OutputLocators.CNKI[str(name.upper())]
        # target_data = SearchTextElement()
        return target_data
    
    def number_of_items_found(self):
        """Return the number of items found on a single page."""
        driver = self.driver
        target_data = driver.find_elements(*OutputLocators.CNKI['INDEX'])
        
        return len(target_data)
    
    def get_elements(self, name):
        """Returns simple list of values in specific data field in a table."""
        driver = self.driver
        target_data = driver.find_elements(*OutputLocators.CNKI[str(name.upper())])
        
        elements = []
        for item in target_data:
            elements.append(item.text)
        
        return elements


    def get_structured_elements(self, name):
        """For target data that is nested and structured,
        such as a table with multiple values in a single cell."""
        driver = self.driver

        i = 2 # keep track of 'i' to retain the document structure.
        number_of_items = self.number_of_items_found()
        elements = [None] * number_of_items

        while i - 2 < number_of_items:
            
            target_data = driver.find_elements(*OutputLocators.CNKI[str(name.upper())])

            for item in target_data:
                print(item.text, i - 1)
                if elements[i - 2] == None:
                    elements[i - 2] = item.text
                elif isinstance(elements[i - 2], list):
                    elements[i - 2].append(item.text)
                else:
                    elements[i - 2] = [elements[i - 2]]
                    elements[i - 2].append(item.text)
            i += 1
    
        return elements
    
    def simple_get_structured_elements(self):
        """Simple structured elements code with fixed xpath."""
        driver = self.driver

        i = 2 # keep track of 'i' to retain the document structure.
        number_of_items = self.number_of_items_found()
        elements = [None] * number_of_items
        
        while i - 2 < number_of_items:
            target_data = driver.find_elements_by_xpath\
            ('//*[@id="Form1"]/table/tbody/tr[2]/td/table/tbody/tr['\
                 + str(i) + ']/td[2]/a')

            for item in target_data:
                print(item.text, i-1)
                if elements[i - 2] == None:
                    elements[i - 2] = item.text
                elif isinstance(elements[i - 2], list):
                    elements[i - 2].append(item.text)
                else:
                    elements[i - 2] = [elements[i - 2]]
                    elements[i - 2].append(item.text)
            i += 1

        return elements

element.py

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from selenium.webdriver.support.ui import WebDriverWait


class BasePageElement():
    """Base page class that is initialized on every page object class."""
    
    def __set__(self, obj, value):
        """Sets the text to the value supplied"""
        driver = obj.driver
        
        text_field = WebDriverWait(driver, 100).until(
            lambda driver: driver.find_element(*self.locator))
        text_field.clear()
        text_field.send_keys(value)
        text_field.submit()

    def __get__(self, obj, owner):
        """Gets the text of the specified object"""
        driver = obj.driver
        
        WebDriverWait(driver, 100).until(
            lambda driver: driver.find_element(*self.locator))
        element = driver.find_element(*self.locator)
        return element.get_attribute("value")

locators.py

This is where split_xpath_at_i sits.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from selenium.webdriver.common.by import By
# import page

class InputLocators():
    """A class for main page locators. All main page locators should come here"""
        
    def dropdown_list_xpath(attribute, value):
        string = "//select[@" + attribute + "='" + value + "']"
        
        return string
    
    MAIN_BODY = (By.XPATH, '//GridTableContent/tbody')
    SEARCH_FIELD = (By.NAME, 'txt_1_value1') # (By.ID, 'search-content-box')
    SEARCH_ATTR = (By.XPATH, dropdown_list_xpath('name', 'txt_1_sel'))
    SEARCH_BUTTON = (By.ID, 'btnSearch')
    NEXT_PAGE = (By.LINK_TEXT, "下頁")

class OutputLocators():
    """A class for search results locators. All search results locators should
    come here"""
    
    def split_xpath_at_i(front_half, back_half):
        # try:
        #     i = page.SearchResults.g_s_elem
        # except:
        #     pass

        if 'i' in locals():
            string = front_half + str(i) + back_half
        else:
            string = front_half+"SPLIT_i"+back_half
    
        return string

    CNKI = {
        "TITLES": (By.XPATH, split_xpath_at_i('//*[@id="Form1"]/table/tbody/tr[2]/td/table/tbody/tr[', ']/td[2]/a')),
        "AUTHORS": (By.XPATH, split_xpath_at_i('//*[@id="Form1"]/table/tbody/tr[2]/td/table/tbody/tr[', ']/td[3]/a')),
        "JOURNALS": '//*[@id="Form1"]/table/tbody/tr[2]/td/table/tbody/tr/td[4]/a',
        "YEAR_ISSUE": '//*[@id="Form1"]/table/tbody/tr[2]/td/table/tbody/tr/td[5]/a',
        "DOWNLOAD_PATHS": '//*[@id="Form1"]/table/tbody/tr[2]/td/table/tbody/tr/td[1]/table/tbody/tr/td/a[1]', 
        "INDEX": (By.XPATH, '//*[@id="Form1"]/table/tbody/tr[2]/td/table/tbody/tr/td[1]/table/tbody/tr/td/a[2]')
    }


    # # Interim Data
    # CAPTIONS = 
    # LINKS = 
    
    # Target Data
    # TITLES = 
    # AUTHORS = 
    # JOURNALS = 
    # VOL = 
    # ISSUE = 
    # DATES = 
    # DOWNLOAD_PATHS =

Can you include all of your scraping code, or at least a representative sample of HTML that you're trying to parse? — Reinderien
– Reinderien, Commented Jun 7, 2021 at 16:55
The whole code consist of a couple .py files. Would it be too much to post them here? — Sati
– Sati, Commented Jun 7, 2021 at 16:58
Nope :) The code length limit is quite high, and for the purposes of a question like this, unless the files are perhaps 1000+ lines each, posting them full-form can only help your question. — Reinderien
– Reinderien, Commented Jun 7, 2021 at 17:08

Reinderien · Accepted Answer · 2021-06-08 14:27:21Z

First: I would typically recommend that you replace your use of Selenium with direct requests calls. If it's possible, it's way more efficient than Selenium. It would look like the following, as a very rough start:

from time import time
from typing import Iterable
from urllib.parse import quote
from requests import Session

def js_encode(u: str) -> Iterable[str]:
    for char in u:
        code = ord(char)
        if code < 128:
            yield quote(char).lower()
        else:
            yield f'%u{code:04x}'


def search(query: str):
    topic = '主题'
    # China Academic Literature Online Publishing Database
    catalog = '中国学术文献网络出版总库'
    databases = (
        '中国期刊全文数据库,'            # China Academic Journals Full-text Database
        '中国博士学位论文全文数据库,'     # China Doctoral Dissertation Full-text Database
        '中国优秀硕士学位论文全文数据库,'  # China Master's Thesis Full-text Database
        '中国重要会议论文全文数据库,'     # China Proceedings of Conference Full-text Database
        '国际会议论文全文数据库,'        # International Proceedings of Conference Full-text Database
        '中国重要报纸全文数据库,'        # China Core Newspapers Full-text Database
        '中国年鉴网络出版总库'          # China Yearbook Full-text Database
    )

    with Session() as session:
        session.headers = {
            'Accept':
                'text/html,'
                'application/xhtml+xml,'
                'application/xml;q=0.9,'
                'image/webp,'
                '*/*;q=0.8',
            'Accept-Encoding': 'gzip, deflate',
            'Accept-Language': 'en-CA,en-GB;q=0.8,en;q=0.5,en-US;q=0.3',
            'Cache-Control': 'no-cache',
            'Connection': 'keep-alive',
            'DNT': '1',
            'Host': 'big5.oversea.cnki.net',
            'Pragma': 'no-cache',
            'Sec-GPC': '1',
            'User-Agent':
                'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) '
                'Gecko/20100101 '
                'Firefox/89.0',
            'Upgrade-Insecure-Requests': '1',
        }

        with session.get(
            'https://big5.oversea.cnki.net/kns55/brief/result.aspx',
            params={
                'txt_1_value1': query,
                'txt_1_sel': topic,
                'dbPrefix': 'SCDB',
                'db_opt': catalog,
                'db_value': databases,
                'search-action': 'brief/result.aspx',
            },
        ) as response:
            response.raise_for_status()
            search_url = response.url
            search_page = response.text

        encoded_query = ''.join(js_encode(',' + query))
        # epoch milliseconds
        timestamp = round(time()*1000)

        # page_params = {
        #     'curpage': 1,
        #     'RecordsPerPage': 20,
        #     'QueryID': 0,
        #     'ID': '',
        #     'turnpage': 1,
        #     'tpagemode': 'L',
        #     'Fields': '',
        #     'DisplayMode': 'listmode',
        #     'sKuaKuID': 0,
        # }

        with session.get(
            'http://big5.oversea.cnki.net/kns55/brief/brief.aspx',
            params={
                'pagename': 'ASP.brief_result_aspx',
                'dbPrefix': 'SCDB',
                'dbCatalog': catalog,
                'ConfigFile': 'SCDB.xml',
                'research': 'off',
                't': timestamp,
            },
            cookies={
                'FileNameS': quote('cnki:'),
                'KNS_DisplayModel': '',
                'CurTop10KeyWord': encoded_query,
                'RsPerPage': '20',
            },
            headers={
                'Referer': search_url,
            }
        ) as response:
            response.raise_for_status()
            results_iframe = response.text


def main():
    etiquette = '禮學'
    search(query=etiquette)


if __name__ == '__main__':
    main()

Unfortunately, the design of this website is violently awful. State is passed around using a mix of query parameters, cookies, and server-only context that you can't see and relies on request history in a non-trivial way. So even though the above produces, to my knowledge, identical parameters, headers and cookies to those that you see in real life on the website, there's a failure where a couple of dynamically-generated <script> sections in brief.aspx are silently omitted. So I'm giving up on this recommendation.

Shifting gears:

The following recommendations are going to cover scope and class usage, and these should get you toward sanity:

The code in test.py needs to be moved into a function
Only test.py should have a shebang and none of your other files, since only test.py is a meaningful entry point.
Is Input.URL ever used? That probably needs to be deleted
Input.webpage should not be returning anything; driver is already a member on the class.
Input as a whole is suspect. It provides such a thin wrapper around driver as to be basically useless on its own. I would expect the driver.get() to be moved to MainPage.__init__.

InputLocators also does not deserve to be a class. Those constants can basically be distributed to the point of use, i.e.

  wait.until(
      EC.presence_of_element_located(
          By.XPATH, 
          '//GridTableContent/tbody',
      )
  )

Your search_keyword is strange - you start off initializing it as a static, and then change to using it as an instance variable in submit_search. Why? Also, what is keyword? You would benefit from using PEP484 type hints.

switch_to_frame has timing issues and did not work for me at all until I added two waits:

  WebDriverWait(driver, 100).until(
      lambda driver: driver.find_element(
          By.XPATH,
          '//iframe[@name="iframeResult"]',
      ))
  driver.switch_to.frame('iframeResult')

  WebDriverWait(driver, 100).until(
      lambda driver: driver.find_element(
          By.XPATH,
          '//table[@class="GridTableContent"]',
      ))

Your () at the end of base-less classes can be dropped
OutputLocators.CNKI is a dictionary. Why? get_single_element indexes into it, but get_single_element is itself never called.

This code:

    elements = []
    for item in target_data:
        elements.append(item.text)
    return elements

can be replaced with a generator:

for item in target_data:
    yield item.text

This code:

    i = None # get_structured_element counter

does nothing since all local variables are discarded at end of scope.

This code:

    if 'i' in locals():
        string = front_half + str(i) + back_half
    else:
        string = front_half+"SPLIT_i"+back_half

is never going to see its first branch evaluated, since i is not defined locally. I really don't know what you intended here.

These long xpath tree traversals, such as

'//*[@id="Form1"]/table/tbody/tr[2]/td/table/tbody/tr/td[1]/table/tbody/tr/td/a[2]'

are both fragile and difficult to read. In most cases you should be able to condense them by a mix of inner // to omit parts of the path, and judicious references to known attributes.

You specifically ask about

split_xpath_at_i is blind to variables in its immediate environment

If by "its immediate environment" you mean CNKI (etc) that's because its immediate environment, the class static scope, has not yet been initialized. CNKI can get a reference to it but not the opposite. If you want this to have some kind of state like a counter, then it needs to be promoted to an instance method with a self parameter. I don't know how g_s_elem factors into this because it's not defined anywhere.

You ask:

The SearchTextElement class with just one locator variable that is hardcoded - is that a good approach?

Not really. First of all, you've again conflated static and instance variables, because you first initialize a static variable to None and then write an instance variable after construction. Why construct a class at all, if it only holds one member and has no methods?

I've tried the requests call method before, too. It got stuck after submitting the search query. Selenium provides the driver.switch_to.frame('iframeResult') to get past that stage and provide access to search result elements. I wonder if there is an equivalent in the requests world. What you are describing seems like a different issue altogether. — Sati
– Sati, Commented Jun 8, 2021 at 1:30
I see that you have results_iframe = response.text in your sample code as well. — Sati
– Sati, Commented Jun 8, 2021 at 1:42
I suspect the "violently awful" design is a deliberate feature to discourage scraping. — Sati
– Sati, Commented Jun 8, 2021 at 1:45
There are better ways to discourage scraping - I honestly think this is just a product of bad design. For fun, read some of the JS code. It's legacy 12-year-old ASP with a big, crazy mishmash of iframes, AJAX, commented-out code blocks, and developer notes. Bonus points if you can find the comment a developer left confessing to a bad, bad hack. In short, never attribute to malice that which you can attribute to ignorance. — Reinderien
– Reinderien, Commented Jun 8, 2021 at 1:50
@Sati Updating your answer is discouraged. However, posting a new question with your incorporated changes are more than welcome — N3buchadnezzar
– N3buchadnezzar, Commented Jun 8, 2021 at 4:27

Stack Exchange Network

Getting xpath indices to count forward to preserve table structure

The Entire Code:

test.py

query.py

page.py

element.py

locators.py

1 Answer 1

You must log in to answer this question.

Linked

Hot Network Questions

Getting xpath indices to count forward to preserve table structure

The Entire Code:

test.py

query.py

page.py

element.py

locators.py

1 Answer 1

You must log in to answer this question.

Linked

Related

Hot Network Questions