Extract data from a site using python

Question

I am making a program that will extract the data from http://www.gujarat.ngosindia.com/

I wrote the following code :

def split_line(text):

    words = text.split()
    i = 0
    details = ''
    while ((words[i] !='Contact')) and (i<len(words)):
        i=i+1
        if(words[i] == 'Contact:'):
            break
    while ((words[i] !='Purpose')) and (i<len(words)):
        if (words[i] == 'Purpose:'):
            break
        details = details+words[i]+' '
        i=i+1
    print(details)

def get_ngo_detail(ngo_url):
        html=urlopen(ngo_url).read()
        soup = BeautifulSoup(html)
        table = soup.find('table', {'class': 'border3'})
        td = soup.find('td', {'class': 'border'})
        split_line(td.text)

def get_ngo_names(gujrat_url):
    html = urlopen(gujrat_url).read()
    soup = BeautifulSoup(html)

    for link in soup.findAll('div',{'id':'mainbox'}):
        for text in link.find_all('a'):
            print(text.get_text())
            ngo_link = 'http://www.gujarat.ngosindia.com/'+text.get('href')
            get_ngo_detail(ngo_link)
            #NGO_name = text2.get_text())

a = get_ngo_names(BASE_URL)

print a

But when i run this script i only get the name of NGOs and contact person. I want Email, telephone number, website, purpose and contact person.

as a first step towards finding a solution, try throwing in a couple of print() to verify that the data is correct/what you expect in all instances... — Fredrik Pihl
– Fredrik Pihl, Commented Jan 28, 2014 at 11:36

jonrsharpe · Accepted Answer · 2014-01-28 11:49:09Z

1

Your split_line could be improved. Imagine you have this text:

s = """Add: 3rd Floor Khemha House
Drive in Road, Opp Drive in Cinema
Ahmedabad - 380 054
Gujarat
Tel: 91-79-7457611 , 79-7450378
Email: [email protected]
Website: http://www.aavishkaar.org
Contact: Angha Mitra
Purpose: Economics and Finance, Micro-enterprises
Aim/Objective/Mission: To provide timely financing, management support and professional expertise ..."""

Now we can turn this into lines using s.split("\n") (split on each new line), giving a list where each item is a line:

lines = s.split("\n")
lines == ['Add: 3rd Floor Khemha House', 
          'Drive in Road, Opp Drive in Cinema', 
          ...]

We can define a list of the elements we want to extract, and a dictionary to hold the results:

targets = ["Contact", "Purpose", "Email"]
results = {}

And work through each line, capturing the information we want:

for line in lines:
    l = line.split(":")
    if l[0] in targets:
        results[l[0]] = l[1]

This gives me:

results == {'Contact': ' Angha Mitra', 
            'Purpose': ' Economics and Finance, Micro-enterprises', 
            'Email': ' [email protected]'}

answered Jan 28, 2014 at 11:49

jonrsharpe

123k31 gold badges277 silver badges488 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

vivek27 Over a year ago

using your code for the entire list on the website i get the details of 1st 6 but after them i get blank results.

jonrsharpe Over a year ago

Some of the pages have missing data, e.g. gujarat.ngosindia.com/… doesn't have "Website". This is to be expected.

vivek27 Over a year ago

i know some of them miss some data but why is other data which is present not coming ?

jonrsharpe Over a year ago

I don't know, you will have to put some prints in to find out what you're getting and how to process it.

in need of help · Accepted Answer · 2014-01-28 11:42:10Z

0

Try to split the contents of the ngos site better, you can give the "split" method a regular expression to split by. e.g. "[Contact]+[Email]+[telephone number]+[website]+[purpose]+[contact person]

My regular expression could be wrong but this is the direction you should head in.

answered Jan 28, 2014 at 11:42

in need of help

1,62214 silver badges28 bronze badges

Collectives™ on Stack Overflow

Extract data from a site using python

2 Answers 2

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related