How to write csv file from scraped data from web in python

Question

I am trying to scrape data from web pages and able to scrape also. After using below script getting all div class data but I am confused how to write data in CSV file like.

First Data in the first name column Last name data in last name column . .

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = 'http://rerait.telangana.gov.in/PrintPreview/PrintPreview/UHJvamVjdElEPTQmRGl2aXNpb249MSZVc2VySUQ9MjAyODcmUm9sZUlEPTEmQXBwSUQ9NSZBY3Rpb249U0VBUkNIJkNoYXJhY3RlckQ9MjImRXh0QXBwSUQ9'

page = urlopen(html)

data = BeautifulSoup(page, 'html.parser')

name_box = data.findAll('div', attrs={'class': 'col-md-3 col-sm-3'}) #edited companyName_99a4824b -> companyName__99a4824b

for i in range(len(name_box)):
    data = name_box[i].text.strip()

Data:

Information Type
Individual

First Name
KACHAM
Middle Name

Last Name
RAJESHWAR
Father Full Name
RAMAIAH
Do you have any Past Experience ?
No
Do you have any registration in other State than registred State?
No
House Number
8-2-293/82/A/446/1
Building Name
SAI KRUPA
Street  Name
ROAD NO 20
Locality
JUBILEE HILLS
Landmark
JUBILEE HILLS
State
Telangana
Division
Division 1
District
Hyderabad
Mandal
Shaikpet
Village/City/Town

Pin Code
500033
Office Number
04040151614
Fax Number

Website URL

Authority Name

Plan Approval Number
1/18B/06558/2018
Project Name
SKV S ANANDA VILAS
Project Status
New Project
Proposed Date of Completion
17/04/2024
Litigations related to the project ?
No
Project Type
Residential
Are there any Promoter(Land Owner/ Investor) (as defined by Telangana RERA Order) in the project ?
Yes
Sy.No/TS No.
00
Plot No./House No.
10-2-327
Total Area(In sqmts)
526.74
Area affected in Road widening/FTL of Tanks/Nala Widening(In sqmts)
58.51
Net Area(In sqmts)
1
Total Building Units (as per approved plan)
1
Proposed Building Units(as per agreement)
1


Boundaries East
PLOT NO 213
Boundaries West
PLOT NO 215
Boundaries North
PLOT NO 199
Boundaries South
ROAD NO 8
Approved Built up Area (In Sqmts)
1313.55
Mortgage Area  (In Sqmts)
144.28
State
Telangana
District
Hyderabad
Mandal
Maredpally
Village/City/Town

Street
ROAD NO 8
Locality
SECUNDERABAD COURT
Pin Code
500026

above is the data getting after run above code.

Edit

for i in range(len(name_box)):
    data = name_box[i].text.strip()
    print (data)
    fname = 'out.csv'
    with open(fname) as f:
        next(f)
        for line in f:
            head = []
            value = []
            for row in line:
                head.append(row)
            print (row)

Expected

Information Type | First  | Middle Name | Last Name | ......
Individual       | KACHAM |             | RAJESHWAR | .....

I have 200 url but all url data is not same means some of these missing. I want to write such way if data not avaialble then write anotthing just blank.

Please suggest. Thank you in advance

Have you considered using Pandas? Using the built in functions for scraping web content pd.read_html() and writing CSV files df.to_csv() may make your code more readable. — moo
– moo, Commented Nov 4, 2018 at 7:27
yes, it's not writing because, before writing I want to make data such way like expected result. please help — user10468005
– user10468005, Commented Nov 4, 2018 at 7:35
@Mark, No, How will do in pandas? how to write row data in column name like expected result — user10468005
– user10468005, Commented Nov 4, 2018 at 7:38
From the pandas module: pandas.read_html() will parse the HTML tables into a DataFrame. df.to_csv() will convert the DataFrame into CSV. The pandas module documentation is excellent, available at pandas.pydata.org/pandas-docs/stable/generated/… — moo
– moo, Commented Nov 4, 2018 at 18:38

ewwink · Accepted Answer · 2018-11-04 13:17:22Z

1

to write to csv you need to know what value should be in head and body, in this case head value should be html element contain <label

from urllib2 import urlopen
from bs4 import BeautifulSoup

html = 'http://rerait.telangana.gov.in/PrintPreview/PrintPreview/UHJvamVjdElEPTQmRGl2aXNpb249MSZVc2VySUQ9MjAyODcmUm9sZUlEPTEmQXBwSUQ9NSZBY3Rpb249U0VBUkNIJkNoYXJhY3RlckQ9MjImRXh0QXBwSUQ9'

page = urlopen(html)

data = BeautifulSoup(page, 'html.parser')

name_box = data.findAll('div', attrs={'class': 'col-md-3 col-sm-3'}) #edited companyName_99a4824b -> companyName__99a4824b

heads = []
values = []

for i in range(len(name_box)):
    data = name_box[i].text.strip()
    dataHTML = str(name_box[i])
    if 'PInfoType' in dataHTML:
        # <div class="col-md-3 col-sm-3" id="PInfoType">
        # empty value, maybe additional data for "Information Type"
        continue

    if 'for="2"' in dataHTML:
        # <label for="2">No</label>
        # it should be head but actually value
        values.append(data)

    elif '<label' in dataHTML:
        # <label for="PersonalInfoModel_InfoTypeValue">Information Type</label>
        # head or top row
        heads.append(data)

    else:
        # <div class="col-md-3 col-sm-3">Individual</div>
        # value for second row
        values.append(data)

csvData = ', '.join(heads) + '\n' + ', '.join(values)    
with open("results.csv", 'w') as f:
    f.write(csvData)

print "finish."

edited Nov 4, 2018 at 13:17

answered Nov 4, 2018 at 8:40

ewwink

19.3k2 gold badges49 silver badges56 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

user10468005 Over a year ago

Thank you, Can you please explain these lines:

if 'PInfoType' in dataHTML:         continue      if 'for="2"' in dataHTML:         values.append(data)     elif '<label' in dataHTML:         heads.append(data)

user10468005 Over a year ago

I am trying for multiple urls but not getting all column in as same in the first link. second link:

http://rerait.telangana.gov.in/PrintPreview/PrintPreview/UHJvamVjdElEPTE1NiZEaXZpc2lvbj0xJlVzZXJJRD0yMTA2NCZSb2xlSUQ9MSZBcHBJRD0yNDcmQWN0aW9uPVNFQVJDSCZDaGFyYWN0ZXJEPTQ3JkV4dEFwcElEPQ%3d%3d

ewwink Over a year ago

code updated with explanation. for second link just add needed condition if... elif...

user10468005 Over a year ago

Thanks, How to loop for

http://rerait.telangana.gov.in/PrintPreview/PrintPreview/UHJvamVjdElEPTE1NiZEaXZpc2lvbj0xJlVzZXJJRD0yMTA2NCZSb2xlSUQ9MSZBcHBJRD0yNDcmQWN0aW9uPVNFQVJDSCZDaGFyYWN0ZXJEPTQ3JkV4dEFwcElEPQ%3d%3d

this link? and Plot No. 46 & 47, 2-2-647/8 & 2-2-647/8/2 this value i am getting in two columns buy want in one

user10468005 Over a year ago

Also suggest for table data in the same link. Can we scrape in the same csv? using same code

|

stovfl · Accepted Answer · 2018-11-04 10:01:41Z

0

Question: How to write csv file from scraped data

Read the Data into a dict and use csv.DictWriter(... to write to CSV file.
Documentations about: csv.DictWriter while next break Mapping Types — dict

Skip the first line, as it's the title
Loop Data lines
1. key = next(data)
2. value = next(data)
3. Break loop if no further data
4. Build dict[key] = value
After finishing the loop, write dict to CSV file

Output:

{'Individual': '', 'Father Full Name': 'RAMAIAH', 'First Name': 'KACHAM', 'Middle Name': '', 'Last Name': 'RAJESHWAR',... (omitted for brevity)

edited Nov 4, 2018 at 10:01

answered Nov 3, 2018 at 9:42

stovfl

15.6k7 gold badges26 silver badges54 bronze badges

2 Comments

user10468005 Over a year ago

I have tried but not getting expected result. Can you please share the code?

stovfl Over a year ago

@user10468005: Edit your Question and show your effords, explain where you get stuck.

Collectives™ on Stack Overflow

How to write csv file from scraped data from web in python

2 Answers 2

6 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related