0

I am trying to extract the tables generated by selecting "Branches", a city and a district from this site: https://www.acb.com.vn/wps/portal/en/atm

So far, I have been able to write the code to parse through each city and district:

from selenium.webdriver.support.ui import Select
from selenium.webdriver import Chrome
import pandas as pd
import time

webdriver = "chromedriver.exe"

driver = Chrome(webdriver)
driver.get('https://www.acb.com.vn/wps/portal/en/atm')
    
branch_selector = driver.find_element_by_xpath('//*[@id="branch"]')
branch_selector.click()

city = Select(driver.find_element_by_id('cityId'))

for i in range(len(city.options)):
    city.select_by_index(i)
    time.sleep(1)
    
    district = Select(driver.find_element_by_id('districtId'))

    for j in range(len(district.options)):
        district.select_by_index(j)
        time.sleep(1)

        try:
            find_btn = driver.find_element_by_xpath('//*[@id="frm-filter"]/div[3]/a[1]')
            find_btn.click()
            time.sleep(1)
            
        except:
            close_btn = driver.find_element_by_xpath('//*[@id="close-send-email"]/span[2]')
            close_btn.click()
            time.sleep(1)

Now, I want to extract the table that's displayed in each iteration of the 2 loops. However, if you look at the HTML for the table, it does not make use of the "table" tag:

enter image description here

So, how do I extract the table for each city-district pair?

I tried the following:

    try:
        click_btn = driver.find_element_by_xpath('//*[@id="frm-filter"]/div[3]/a[1]')
        click_btn.click()
        time.sleep(1)
        
        table = driver.find_elements_by_class_name('tbody')
        for table_row in table:
            row = table_row.find_elements_by_class_name('row')
            print ([r.text for r in row])
        
    except:
        close_btn = driver.find_element_by_xpath('//*[@id="close-send-email"]/span[2]')
        close_btn.click()
        time.sleep(1)

But it prints a list of blank elements for each city-district pair, the length of the list being as many addresses are present in the table for the corresponding city-district pair:

['', '', '', '']
['', '', '', '']
['', '', '', '']
['', '', '', '']
['', '', '', '']
['', '', '', '']
['', '', '', '', '']
['', '', '', '', '']
['', '', '', '', '']
['', '', '', '', '']
['', '', '', '', '']
['', '', '', '', '']
['', '', '', '']
['', '']
['', '']

I also tried to access each element in each row of the table individually:

    try:
        find_btn = driver.find_element_by_xpath('//*[@id="frm-filter"]/div[3]/a[1]')
        find_btn.click()
        time.sleep(1)
        
        table = driver.find_elements_by_class_name('tbody')
        for table_row in table:
            row = table_row.find_elements_by_class_name('row')
            
            for element in row:
                time.sleep(1)
                
                Type.append(element.find_element_by_class_name('col type'))
                Address.append(element.find_element_by_class_name('col address'))
                District.append(element.find_element_by_class_name('col district'))
                Tel_Fax.append(element.find_element_by_class_name('col tel-fax'))
                Hours.append(element.find_element_by_class_name('col hours'))
        
    except:
        close_btn = driver.find_element_by_xpath('//*[@id="close-send-email"]/span[2]')
        close_btn.click()
        time.sleep(1)

But this gives the following error:

---------------------------------------------------------------------------
NoSuchElementException                    Traceback (most recent call last)
<ipython-input-41-2d73f0dc931c> in <module>
     39 
---> 40                     Type.append(element.find_element_by_class_name('col type'))
     41                     Address.append(element.find_element_by_class_name('col address'))

NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"[id="col type"]"}

Since it says css selector in the error, I tried the following:

element.find_element_by_css_selector('div.col.type').text

This outputs a blank string, ''.

So, how do I do this?

EDIT: The HTML of the table, for one district-city selection, is:

        <div class="tbody">
            
            <div class="row" id="row1">
                <div class="col stt">1</div>
                <div class="col type">
                
                PGD Hai Bà Trưng</div>
                <div class="col address">56-58-60 Hai Bà Trưng, P. Bến Nghé, Quan 1, Ho Chi Minh</div>
                <div class="col district">1</div>
                <div class="col tel-fax">(028) 6291 3690<br>(028) 6291 3691</div>
                <div class="col hours">      07:00-16:30</div>
                <div class="col control"><a href="#" title="Direction" class="btn-direction" onclick="showDialogDirection('56-58-60 Hai Bà Trưng, P. Bến Nghé, Quan 1, Ho Chi Minh', '10.77714,106.704325', 1); return false;">Direction</a></div>
            </div>
            
            <div class="row" id="row2">
                <div class="col stt">2</div>
                <div class="col type">
                
                PGD Đa Kao</div>
                <div class="col address">45 Võ Thị Sáu, P. Đa Kao, Quan 1, Ho Chi Minh</div>
                <div class="col district">1</div>
                <div class="col tel-fax">(028) 6290 5980<br>(028) 6290 5981</div>
                <div class="col hours">  07:30 – 16:30</div>
                <div class="col control"><a href="#" title="Direction" class="btn-direction" onclick="showDialogDirection('45 Võ Thị Sáu, P. Đa Kao, Quan 1, Ho Chi Minh', '10.790715,106.69486', 2); return false;">Direction</a></div>
            </div>
            
            <div class="row" id="row3">
                <div class="col stt">3</div>
                <div class="col type">
                
                PGD Nguyễn Công Trứ</div>
                <div class="col address">74 - 76 Nguyễn Công Trứ, P. Nguyễn Thái Bình, Quan 1, Ho Chi Minh</div>
                <div class="col district">1</div>
                <div class="col tel-fax">(028) 3914 4470 <br>(028) 3914 4471</div>
                <div class="col hours">  07:30 – 16:30</div>
                <div class="col control"><a href="#" title="Direction" class="btn-direction" onclick="showDialogDirection('74 - 76 Nguyễn Công Trứ, P. Nguyễn Thái Bình, Quan 1, Ho Chi Minh', '10.76972,106.703142', 3); return false;">Direction</a></div>
            </div>
            
            <div class="row" id="row4">
                <div class="col stt">4</div>
                <div class="col type">
                
                PGD Lê Lợi</div>
                <div class="col address">72 Lê Lợi, P. Bến Thành, Quận 1, TP.Hồ Chí Minh</div>
                <div class="col district">1</div>
                <div class="col tel-fax">(028) 3821 4619<br>(028) 3821 4618</div>
                <div class="col hours">    07:00-16:30</div>
                <div class="col control"><a href="#" title="Direction" class="btn-direction" onclick="showDialogDirection('72 Lê Lợi, P. Bến Thành, Quận 1, TP.Hồ Chí Minh', '10.773541,106.699635', 4); return false;">Direction</a></div>
            </div>
            
            <div class="row" id="row5">
                <div class="col stt">5</div>
                <div class="col type">
                
                CN Sài Gòn</div>
                <div class="col address">41 Mạc Đỉnh Chi, P. Đakao, Quan 1, Ho Chi Minh</div>
                <div class="col district">1</div>
                <div class="col tel-fax">(028) 3824 3770<br>(028) 3824 3946</div>
                <div class="col hours">  07:30 – 16:30</div>
                <div class="col control"><a href="#" title="Direction" class="btn-direction" onclick="showDialogDirection('41 Mạc Đỉnh Chi, P. Đakao, Quan 1, Ho Chi Minh', '10.786191,106.697818', 5); return false;">Direction</a></div>
            </div>
            
            <div class="row" id="row6">
                <div class="col stt">6</div>
                <div class="col type">
                
                PGD Nguyễn Thái Bình</div>
                <div class="col address">176 – 178 Ký Con, P. Nguyễn Thái Bình, Quan 1, Ho Chi Minh</div>
                <div class="col district">1</div>
                <div class="col tel-fax">(028) 3915 1310<br>(028) 3915 1311</div>
                <div class="col hours">  07:30 – 16:30</div>
                <div class="col control"><a href="#" title="Direction" class="btn-direction" onclick="showDialogDirection('176 – 178 Ký Con, P. Nguyễn Thái Bình, Quan 1, Ho Chi Minh', '10.768917,106.696863', 6); return false;">Direction</a></div>
            </div>
            
            <div class="row" id="row7">
                <div class="col stt">7</div>
                <div class="col type">
                
                PGD Bến Chương Dương</div>
                <div class="col address">328 Võ Văn Kiệt, phường Cô Giang, Quận 1, Tp.HCM</div>
                <div class="col district">1</div>
                <div class="col tel-fax">(028) 3837 0586<br>(028) 3837 0584</div>
                <div class="col hours">   7h30-16h30</div>
                <div class="col control"><a href="#" title="Direction" class="btn-direction" onclick="showDialogDirection('328 Võ Văn Kiệt, phường Cô Giang, Quận 1, Tp.HCM', '10.76161,106.695998', 7); return false;">Direction</a></div>
            </div>
            
            <div class="row" id="row8">
                <div class="col stt">8</div>
                <div class="col type">
                
                PGD Trần Khắc Chân</div>
                <div class="col address">48-50 Nguyễn Hữu Cầu, P.Tân Định, Q.1, TP.HCM</div>
                <div class="col district">1</div>
                <div class="col tel-fax">(028) 3820 9990<br>(028) 3526 7738</div>
                <div class="col hours"> 07:30 -16:30</div>
                <div class="col control"><a href="#" title="Direction" class="btn-direction" onclick="showDialogDirection('48-50 Nguyễn Hữu Cầu, P.Tân Định, Q.1, TP.HCM', '10.790724, 106.690976', 8); return false;">Direction</a></div>
            </div>
            
            <div class="row" id="row9">
                <div class="col stt">9</div>
                <div class="col type">
                
                PGD Cống Quỳnh</div>
                <div class="col address">106  108 Cống Quỳnh, P. Nguyễn Cư Trinh, Q.1</div>
                <div class="col district">1</div>
                <div class="col tel-fax">(028) 38385464<br>(028) 3925 6645</div>
                <div class="col hours"> 07:30 -16:30</div>
                <div class="col control"><a href="#" title="Direction" class="btn-direction" onclick="showDialogDirection('106  108 Cống Quỳnh, P. Nguyễn Cư Trinh, Q.1', '10.764772,106.687505', 9); return false;">Direction</a></div>
            </div>
            
            <div class="row" id="row10">
                <div class="col stt">10</div>
                <div class="col type">
                
                CN Bến Thành</div>
                <div class="col address">96 Lý Tự Trọng, P. Bến Thành, Quan 1, Ho Chi Minh</div>
                <div class="col district">1</div>
                <div class="col tel-fax">(028) 3825 7949<br>(028) 3825 7950</div>
                <div class="col hours"> 07:30-16:30</div>
                <div class="col control"><a href="#" title="Direction" class="btn-direction" onclick="showDialogDirection('96 Lý Tự Trọng, P. Bến Thành, Quan 1, Ho Chi Minh', '10.774379, 106.697395', 10); return false;">Direction</a></div>
            </div>
            
            <div class="row" id="row11">
                <div class="col stt">11</div>
                <div class="col type">
                
                PGD Tân Định   </div>
                <div class="col address">261 Trần Quang Khải, Phường Tân Định, Quận 1, TP.HCM</div>
                <div class="col district">1</div>
                <div class="col tel-fax">(028) 3848 0520<br></div>
                <div class="col hours"> 07:30 - 16:30</div>
                <div class="col control"><a href="#" title="Direction" class="btn-direction" onclick="showDialogDirection('261 Trần Quang Khải, Phường Tân Định, Quận 1, TP.HCM', '10.791284, 106.688080', 11); return false;">Direction</a></div>
            </div>
            
            <div class="row" id="row12">
                <div class="col stt">12</div>
                <div class="col type">
                
                PGD Nguyễn Du</div>
                <div class="col address">Tầng hầm 1, tầng trệt, tầng lửng và tầng 2 tòa nhà 480 đường Nguyễn Thị Minh Khai, Phường 2, Quận 3, TP.Hồ Chí Minh</div>
                <div class="col district">1</div>
                <div class="col tel-fax">(028) 35218626<br>(028) 35218627</div>
                <div class="col hours"> 07:30 -16:30</div>
                <div class="col control"><a href="#" title="Direction" class="btn-direction" onclick="showDialogDirection('Tầng hầm 1, tầng trệt, tầng lửng và tầng 2 tòa nhà 480 đường Nguyễn Thị Minh Khai, Phường 2, Quận 3, TP.Hồ Chí Minh', '10.777328,106.698459', 12); return false;">Direction</a></div>
            </div>
            
        </div>
4
  • 1
    There are no td tags anywhere for Selenium to find. The address text of each branch is contained within a <span class="address">. Commented Jul 10, 2020 at 14:34
  • Please copy and paste the HTML code. You can right-click on a tag on the DOM inspector and copy it. Commented Jul 10, 2020 at 16:27
  • @PaulM. Edited the question with what I believe should be the correct way to parse the table. Commented Jul 11, 2020 at 4:38
  • @GregBurghardt Added the HTML code. Commented Jul 11, 2020 at 4:40

1 Answer 1

1
+50

On analysing the website, it makes a post request on submitting the form. The function in the website is as follows:

function findMap() {        
        var keyWord =document.getElementById("keyWord").value;
        var cityId =document.getElementById("cityId").value;
        var districtId =document.getElementById("districtId").value;
        var isCheckBranch = document.getElementById("branch").checked;
        var isCheckAtm = document.getElementById("atm").checked;
        var isCheckWestern = document.getElementById("western").checked;
        var isCheckCdm = document.getElementById("cdm").checked;
        var branch="";
        var atm="";
        var western="";
        var cdm="";
        
        var input = document.getElementById ("keyWord");
        var placeholder = input.placeholder;
        if( keyWord == placeholder ){
            keyWord = "";
        }
        
        if((!isCheckBranch) && (!isCheckAtm) && (!isCheckWestern) && (!isCheckCdm)){
            showMessage('Please select Branch or ATM or Western Union or CDM.', 'branch');
            return;
        }
        
        if((!districtId || 0 === districtId.length) && (!keyWord || 0 === keyWord)){
            showMessage('Please select the province or enter the address.', 'keyWord');
            return;
        }
        if(isCheckBranch){
            branch = "branch";
        }
        if(isCheckAtm){
            atm = "atm";
        }
        if(isCheckWestern){
            western = "western";
        }
        if(isCheckCdm){
            cdm = "cdm";
        }
        var url = '/ACBMapPortlet/en/Process.jsp';
        var urlPattern = 'https://www.acb.com.vn:443/ACBMapPortlet/en/MapMobi.jsp';
        $( "#resultSearch" ).load( url, { "params[]": [ "Search", branch, atm, western, cdm, districtId, keyWord, cityId, latlng, urlPattern]} );
        
    }

So, now you can understand what happens when you click the submit button.

The website constructs the values as a form data. I'll explain one such request that contains as in the following screenshot

enter image description here

Scraping in python using the above information.

Here - cityId = 18, DistrictId (populated through ajax call) = 187

import requests
from bs4 import BeautifulSoup
import pandas as pd

res=requests.post("https://www.acb.com.vn/ACBMapPortlet/en/Process.jsp", data={"params[]": ["Search","branch","atm","western","cdm",187,"",18,0,"https://www.acb.com.vn:443/ACBMapPortlet/en/MapMobi.jsp"]})

result = res.text.replace("\n","").replace("\t","").replace("\r","")

soup = BeautifulSoup(result, "lxml")
headers = [i.text.strip() for i in soup.find("div",class_="thead").find_all("div",class_="col")[:-1]]
body = [[j.text.strip() for j in i.find_all("div",class_="col")[:-1]] for i in soup.find("div",class_="tbody").find_all("div",class_="row")]

df = pd.DataFrame(body, columns=headers)

print(df)

df.to_csv("data.csv", index=False)

Output: enter image description here

Update 1:

In order to get the city Id - city id is hard coded in the website in the select tag value attribute.

enter image description here

In order to district Id: To get this, the website makes an ajax call.

function getDistrict(cityId) {
        var url = '/ACBMapPortlet/en/DistrictSelectBox.jsp';
        $.post( url, { cmd:'DISTRICT', cityId:cityId}, function(data) {
            var content = $( data );
            $("#divDistrict span").empty().append("District");
            $("#iconselect").empty();
            $("#districtId").empty().append(content);               
         });     
    }

How to get all districts given a cityId?

def districts_names(cityid):
    data = {"cmd":"DISTRICT", "cityId": cityid}
    res = res = requests.post("https://www.acb.com.vn/ACBMapPortlet/en/DistrictSelectBox.jsp", data=data)
    soup = BeautifulSoup(res.text, "lxml")
    return [(i["value"].strip(),i.text.strip()) for i in soup.find_all("option")]

Example: districts_names(3) will give the following

[('', 'District'),
 ('234', 'Ba Bể'),
 ('235', 'Bạch Thông'),
 ('236', 'Chợ Đồn'),
 ('237', 'Chợ Mới'),
 ('238', 'Na Rì'),
 ('239', 'Ngân Sơn'),
 ('240', 'Bắc Kạn')]

The output is of the format - (district_id, district_name)

Sign up to request clarification or add additional context in comments.

10 Comments

I see. So, is there no way to extract info from the table in the way I was trying? Also, in your way of doing it, how do I find the list of city IDs and district IDs so that I can loop through them to extract the info for all cities and districts? And lastly, I went to the website, went to city number 18 on the dropdown and to the district with the corresponding name from the table your code generates, and the website only displays 1 row in the table for this city-district pair, which is the second row in the table your code generates. So, where are the rest of the rows coming from? Thanks!
@Kristada673 When you choose all - branch, atms, western unions and cdms you will get nine results. I have shown in the screenshot
Ah, I see; I was selecting only "Branches" in the website. Lot of things are not clear to me, as this way of scraping is pretty new to me. For starters, I don't even know where these 2 URLs are coming from in your code: https://www.acb.com.vn/ACBMapPortlet/en/Process.jsp and https://www.acb.com.vn:443/ACBMapPortlet/en/MapMobi.jsp. And also, if you could mention where do I get the list of city numbers and district numbers, that would be a big help. Thanks!
@Kristada673 Updated the answer, I would suggest to put a bounty on this question. As this took a bit of work
After putting the bounty, award the bounty to this answer if you feel this answer has solved your problem
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.