Extracting table data from web using python

Question

I need to extract a table from a website "https://geniusimpex.org/pakistan-import-data/" which has like thousands of rows, so I wanted to automate the process using bs4 and selenium but when I extract the table only the table header is extracted. This is the code I used

from bs4 import BeautifulSoup   
from urllib.request import urlopen

url = "https://geniusimpex.org/pakistan-import-data/"

html = urlopen(url)

soup = BeautifulSoup(html, 'lxml')  
type(soup)  
soup.prettify()  
print(soup.find_all('tr'))

It shows the following output
[1]: https://i.sstatic.net/GItzv.png

As you can see only first row is extracted. Can someone please tell me why I am not able to extract the table and how can I do so? It will be very helpful. Sorry if I am not clear or couldn't explain my problem. This is the first time I am asking a question on stack overflow.

This could be all manner of things. Help us to help you by posting the relevant code to your problem. — bm13563
– bm13563, Commented Aug 17, 2020 at 11:39
My suspicion is that you're using bs4 to try and extract the table. If the table is rendered by javascript then that won't be possible, since the data won't have had time to load. Try using selenium to extract this part of the table. — bm13563
– bm13563, Commented Aug 17, 2020 at 11:42
I think you are right but I also used selenium and the result was the same. It could not detect any other row except for the first one. — Umair Amir
– Umair Amir, Commented Aug 17, 2020 at 11:59

Andrej Kesely · Accepted Answer · 2020-08-17 12:18:58Z

1

The data is loaded from external URL as Json. You can use this script to load the information:

import json
import requests


url = 'https://geniusimpex.org/wp-admin/admin-ajax.php?action=ge_forecast_list_data&order=asc&offset={offset}&limit=1000'

offset = 0
while True:
    data = requests.get(url.format(offset=offset)).json()

    # print data to screen:
    for row in data.get('rows', []):
        for k, v in row.items():
            print('{:<30} {}'.format(k, v))
        print('-' * 80)

    if len(data.get('rows', [])) != 1000:
        break

    offset += 1000

Prints:

...

--------------------------------------------------------------------------------
count                          T
importer_name                  <span file_id="27893" post_count="T" post_id="2157293">BISMILLAH STEEL FURNACE \n NEAR GRID STATION DEEWAN</span>
goods_description              IRON AND STEEL REMELTABLE SCRAP HARMONIZED CODE: 7204.4990 REFERENCE NUMBER:UM/PAK/5146A ITN: X20200629019843 NWT WEIGHT-19.650 MT SHIPPERS LOAD, STOWAGE AND COUNT
hs_code                        
shipment_port                   NEWARK  APT/NEW 
gross_weight                    19.65 
number_of_packages              1 
unit_of_packages                PACKAGES 
size_of_container               1 X 20FT 
imported_from_name             SEALINK INTERNATIONAL INC C/O\n UNIVERSAL METALS, ,
bill_of_lading_number           SII145321 
bill_of_lading_date            <span data="10-08-2020">10-08-2020</span>
--------------------------------------------------------------------------------
count                          T
importer_name                  <span file_id="27938" post_count="T" post_id="2159597">ASAD SHAHZAD S/O FAQIR ZADA</span>
goods_description              1 USED VEHICLE TOYOTA VITZ CHASSIS NO: KSP130 -2204837
hs_code                        NA
shipment_port                   NAGOYA,  AICHI 
gross_weight                    .97 
number_of_packages              1 
unit_of_packages                UNIT 
size_of_container               1 X 40FT 
imported_from_name             KASHMIR MOTORS , 3055-9-104 KUZUTSUKA NIIGATA KITA
bill_of_lading_number           TA200716H06- 10 
bill_of_lading_date            <span data="10-08-2020">10-08-2020</span>
--------------------------------------------------------------------------------


...

EDIT: To save to CSV, you can use this script:

import json
import requests
import pandas as pd


url = 'https://geniusimpex.org/wp-admin/admin-ajax.php?action=ge_forecast_list_data&order=asc&offset={offset}&limit=1000'

offset = 0
all_data = []
while True:
    data = requests.get(url.format(offset=offset)).json()

    # print data to screen:
    for row in data.get('rows', []):
        all_data.append(row)
        for k, v in row.items():
            print('{:<30} {}'.format(k, v))
        print('-' * 80)

    if len(data.get('rows', [])) != 1000:
        break

    offset += 1000

df = pd.DataFrame(all_data)
df.to_csv('data.csv')

Produces:

edited Aug 17, 2020 at 12:18

answered Aug 17, 2020 at 11:58

Andrej Kesely

196k15 gold badges60 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Umair Amir Over a year ago

Thank you so much bro. But is there any way that i can extract more than 1000 rows?

Andrej Kesely Over a year ago

@UmairAmir Running this script will load 1000 rows each iteration, until the all rows are read (look at the offset variable)

Umair Amir Over a year ago

Yes! it is working. Thank you. Is there a way for me to paste all this data in excel?

Andrej Kesely Over a year ago

@UmairAmir Just open the produced csv file in Excel.

Umair Amir Over a year ago

Thank you so much bro. You can't imagine how much this means to me.

|

Collectives™ on Stack Overflow

Extracting table data from web using python

1 Answer 1

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related