1

I need to extract a table from a website "https://geniusimpex.org/pakistan-import-data/" which has like thousands of rows, so I wanted to automate the process using bs4 and selenium but when I extract the table only the table header is extracted. This is the code I used

from bs4 import BeautifulSoup   
from urllib.request import urlopen

url = "https://geniusimpex.org/pakistan-import-data/"

html = urlopen(url)

soup = BeautifulSoup(html, 'lxml')  
type(soup)  
soup.prettify()  
print(soup.find_all('tr'))  

It shows the following output
[1]: https://i.sstatic.net/GItzv.png

As you can see only first row is extracted. Can someone please tell me why I am not able to extract the table and how can I do so? It will be very helpful. Sorry if I am not clear or couldn't explain my problem. This is the first time I am asking a question on stack overflow.

3
  • This could be all manner of things. Help us to help you by posting the relevant code to your problem. Commented Aug 17, 2020 at 11:39
  • My suspicion is that you're using bs4 to try and extract the table. If the table is rendered by javascript then that won't be possible, since the data won't have had time to load. Try using selenium to extract this part of the table. Commented Aug 17, 2020 at 11:42
  • I think you are right but I also used selenium and the result was the same. It could not detect any other row except for the first one. Commented Aug 17, 2020 at 11:59

1 Answer 1

1

The data is loaded from external URL as Json. You can use this script to load the information:

import json
import requests


url = 'https://geniusimpex.org/wp-admin/admin-ajax.php?action=ge_forecast_list_data&order=asc&offset={offset}&limit=1000'

offset = 0
while True:
    data = requests.get(url.format(offset=offset)).json()

    # print data to screen:
    for row in data.get('rows', []):
        for k, v in row.items():
            print('{:<30} {}'.format(k, v))
        print('-' * 80)

    if len(data.get('rows', [])) != 1000:
        break

    offset += 1000

Prints:

...

--------------------------------------------------------------------------------
count                          T
importer_name                  <span file_id="27893" post_count="T" post_id="2157293">BISMILLAH STEEL FURNACE \n NEAR GRID STATION DEEWAN</span>
goods_description              IRON AND STEEL REMELTABLE SCRAP HARMONIZED CODE: 7204.4990 REFERENCE NUMBER:UM/PAK/5146A ITN: X20200629019843 NWT WEIGHT-19.650 MT SHIPPERS LOAD, STOWAGE AND COUNT
hs_code                        
shipment_port                   NEWARK  APT/NEW 
gross_weight                    19.65 
number_of_packages              1 
unit_of_packages                PACKAGES 
size_of_container               1 X 20FT 
imported_from_name             SEALINK INTERNATIONAL INC C/O\n UNIVERSAL METALS, ,
bill_of_lading_number           SII145321 
bill_of_lading_date            <span data="10-08-2020">10-08-2020</span>
--------------------------------------------------------------------------------
count                          T
importer_name                  <span file_id="27938" post_count="T" post_id="2159597">ASAD SHAHZAD S/O FAQIR ZADA</span>
goods_description              1 USED VEHICLE TOYOTA VITZ CHASSIS NO: KSP130 -2204837
hs_code                        NA
shipment_port                   NAGOYA,  AICHI 
gross_weight                    .97 
number_of_packages              1 
unit_of_packages                UNIT 
size_of_container               1 X 40FT 
imported_from_name             KASHMIR MOTORS , 3055-9-104 KUZUTSUKA NIIGATA KITA
bill_of_lading_number           TA200716H06- 10 
bill_of_lading_date            <span data="10-08-2020">10-08-2020</span>
--------------------------------------------------------------------------------


...

EDIT: To save to CSV, you can use this script:

import json
import requests
import pandas as pd


url = 'https://geniusimpex.org/wp-admin/admin-ajax.php?action=ge_forecast_list_data&order=asc&offset={offset}&limit=1000'

offset = 0
all_data = []
while True:
    data = requests.get(url.format(offset=offset)).json()

    # print data to screen:
    for row in data.get('rows', []):
        all_data.append(row)
        for k, v in row.items():
            print('{:<30} {}'.format(k, v))
        print('-' * 80)

    if len(data.get('rows', [])) != 1000:
        break

    offset += 1000

df = pd.DataFrame(all_data)
df.to_csv('data.csv')

Produces:

enter image description here

Sign up to request clarification or add additional context in comments.

7 Comments

Thank you so much bro. But is there any way that i can extract more than 1000 rows?
@UmairAmir Running this script will load 1000 rows each iteration, until the all rows are read (look at the offset variable)
Yes! it is working. Thank you. Is there a way for me to paste all this data in excel?
@UmairAmir Just open the produced csv file in Excel.
Thank you so much bro. You can't imagine how much this means to me.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.