2

I am trying to scrape data from web pages and able to scrape also. After using below script getting all div class data but I am confused how to write data in CSV file like.

First Data in the first name column Last name data in last name column . .

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = 'http://rerait.telangana.gov.in/PrintPreview/PrintPreview/UHJvamVjdElEPTQmRGl2aXNpb249MSZVc2VySUQ9MjAyODcmUm9sZUlEPTEmQXBwSUQ9NSZBY3Rpb249U0VBUkNIJkNoYXJhY3RlckQ9MjImRXh0QXBwSUQ9'

page = urlopen(html)

data = BeautifulSoup(page, 'html.parser')

name_box = data.findAll('div', attrs={'class': 'col-md-3 col-sm-3'}) #edited companyName_99a4824b -> companyName__99a4824b

for i in range(len(name_box)):
    data = name_box[i].text.strip()

Data:

Information Type
Individual

First Name
KACHAM
Middle Name

Last Name
RAJESHWAR
Father Full Name
RAMAIAH
Do you have any Past Experience ?
No
Do you have any registration in other State than registred State?
No
House Number
8-2-293/82/A/446/1
Building Name
SAI KRUPA
Street  Name
ROAD NO 20
Locality
JUBILEE HILLS
Landmark
JUBILEE HILLS
State
Telangana
Division
Division 1
District
Hyderabad
Mandal
Shaikpet
Village/City/Town

Pin Code
500033
Office Number
04040151614
Fax Number

Website URL

Authority Name

Plan Approval Number
1/18B/06558/2018
Project Name
SKV S ANANDA VILAS
Project Status
New Project
Proposed Date of Completion
17/04/2024
Litigations related to the project ?
No
Project Type
Residential
Are there any Promoter(Land Owner/ Investor) (as defined by Telangana RERA Order) in the project ?
Yes
Sy.No/TS No.
00
Plot No./House No.
10-2-327
Total Area(In sqmts)
526.74
Area affected in Road widening/FTL of Tanks/Nala Widening(In sqmts)
58.51
Net Area(In sqmts)
1
Total Building Units (as per approved plan)
1
Proposed Building Units(as per agreement)
1


Boundaries East
PLOT NO 213
Boundaries West
PLOT NO 215
Boundaries North
PLOT NO 199
Boundaries South
ROAD NO 8
Approved Built up Area (In Sqmts)
1313.55
Mortgage Area  (In Sqmts)
144.28
State
Telangana
District
Hyderabad
Mandal
Maredpally
Village/City/Town

Street
ROAD NO 8
Locality
SECUNDERABAD COURT
Pin Code
500026

above is the data getting after run above code.

Edit

for i in range(len(name_box)):
    data = name_box[i].text.strip()
    print (data)
    fname = 'out.csv'
    with open(fname) as f:
        next(f)
        for line in f:
            head = []
            value = []
            for row in line:
                head.append(row)
            print (row)

Expected

Information Type | First  | Middle Name | Last Name | ......
Individual       | KACHAM |             | RAJESHWAR | .....

I have 200 url but all url data is not same means some of these missing. I want to write such way if data not avaialble then write anotthing just blank.

Please suggest. Thank you in advance

5
  • 1
    Have you considered using Pandas? Using the built in functions for scraping web content pd.read_html() and writing CSV files df.to_csv() may make your code more readable. Commented Nov 4, 2018 at 7:27
  • open(fname) is only reading a file, not writing to it Commented Nov 4, 2018 at 7:27
  • yes, it's not writing because, before writing I want to make data such way like expected result. please help Commented Nov 4, 2018 at 7:35
  • @Mark, No, How will do in pandas? how to write row data in column name like expected result Commented Nov 4, 2018 at 7:38
  • From the pandas module: pandas.read_html() will parse the HTML tables into a DataFrame. df.to_csv() will convert the DataFrame into CSV. The pandas module documentation is excellent, available at pandas.pydata.org/pandas-docs/stable/generated/… Commented Nov 4, 2018 at 18:38

2 Answers 2

1

to write to csv you need to know what value should be in head and body, in this case head value should be html element contain <label

from urllib2 import urlopen
from bs4 import BeautifulSoup

html = 'http://rerait.telangana.gov.in/PrintPreview/PrintPreview/UHJvamVjdElEPTQmRGl2aXNpb249MSZVc2VySUQ9MjAyODcmUm9sZUlEPTEmQXBwSUQ9NSZBY3Rpb249U0VBUkNIJkNoYXJhY3RlckQ9MjImRXh0QXBwSUQ9'

page = urlopen(html)

data = BeautifulSoup(page, 'html.parser')

name_box = data.findAll('div', attrs={'class': 'col-md-3 col-sm-3'}) #edited companyName_99a4824b -> companyName__99a4824b

heads = []
values = []

for i in range(len(name_box)):
    data = name_box[i].text.strip()
    dataHTML = str(name_box[i])
    if 'PInfoType' in dataHTML:
        # <div class="col-md-3 col-sm-3" id="PInfoType">
        # empty value, maybe additional data for "Information Type"
        continue

    if 'for="2"' in dataHTML:
        # <label for="2">No</label>
        # it should be head but actually value
        values.append(data)

    elif '<label' in dataHTML:
        # <label for="PersonalInfoModel_InfoTypeValue">Information Type</label>
        # head or top row
        heads.append(data)

    else:
        # <div class="col-md-3 col-sm-3">Individual</div>
        # value for second row
        values.append(data)

csvData = ', '.join(heads) + '\n' + ', '.join(values)    
with open("results.csv", 'w') as f:
    f.write(csvData)

print "finish."
Sign up to request clarification or add additional context in comments.

6 Comments

Thank you, Can you please explain these lines: if 'PInfoType' in dataHTML: continue if 'for="2"' in dataHTML: values.append(data) elif '<label' in dataHTML: heads.append(data)
I am trying for multiple urls but not getting all column in as same in the first link. second link: http://rerait.telangana.gov.in/PrintPreview/PrintPreview/UHJvamVjdElEPTE1NiZEaXZpc2lvbj0xJlVzZXJJRD0yMTA2NCZSb2xlSUQ9MSZBcHBJRD0yNDcmQWN0aW9uPVNFQVJDSCZDaGFyYWN0ZXJEPTQ3JkV4dEFwcElEPQ%3d%3d
code updated with explanation. for second link just add needed condition if... elif...
Thanks, How to loop for http://rerait.telangana.gov.in/PrintPreview/PrintPreview/UHJvamVjdElEPTE1NiZEaXZpc2lvbj0xJlVzZXJJRD0yMTA2NCZSb2xlSUQ9MSZBcHBJRD0yNDcmQWN0aW9uPVNFQVJDSCZDaGFyYWN0ZXJEPTQ3JkV4dEFwcElEPQ%3d%3d this link? and Plot No. 46 & 47, 2-2-647/8 & 2-2-647/8/2 this value i am getting in two columns buy want in one
Also suggest for table data in the same link. Can we scrape in the same csv? using same code
|
0

Question: How to write csv file from scraped data

Read the Data into a dict and use csv.DictWriter(... to write to CSV file.
Documentations about: csv.DictWriter while next break Mapping Types — dict

  1. Skip the first line, as it's the title
  2. Loop Data lines
    1. key = next(data)
    2. value = next(data)
    3. Break loop if no further data
    4. Build dict[key] = value
  3. After finishing the loop, write dict to CSV file

Output:

{'Individual': '', 'Father Full Name': 'RAMAIAH', 'First Name': 'KACHAM', 'Middle Name': '', 'Last Name': 'RAJESHWAR',... (omitted for brevity)

2 Comments

I have tried but not getting expected result. Can you please share the code?
@user10468005: Edit your Question and show your effords, explain where you get stuck.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.