0

I am making a program that will extract the data from http://www.gujarat.ngosindia.com/

I wrote the following code :

def split_line(text):

    words = text.split()
    i = 0
    details = ''
    while ((words[i] !='Contact')) and (i<len(words)):
        i=i+1
        if(words[i] == 'Contact:'):
            break
    while ((words[i] !='Purpose')) and (i<len(words)):
        if (words[i] == 'Purpose:'):
            break
        details = details+words[i]+' '
        i=i+1
    print(details)

def get_ngo_detail(ngo_url):
        html=urlopen(ngo_url).read()
        soup = BeautifulSoup(html)
        table = soup.find('table', {'class': 'border3'})
        td = soup.find('td', {'class': 'border'})
        split_line(td.text)

def get_ngo_names(gujrat_url):
    html = urlopen(gujrat_url).read()
    soup = BeautifulSoup(html)

    for link in soup.findAll('div',{'id':'mainbox'}):
        for text in link.find_all('a'):
            print(text.get_text())
            ngo_link = 'http://www.gujarat.ngosindia.com/'+text.get('href')
            get_ngo_detail(ngo_link)
            #NGO_name = text2.get_text())

a = get_ngo_names(BASE_URL)

print a

But when i run this script i only get the name of NGOs and contact person. I want Email, telephone number, website, purpose and contact person.

2
  • as a first step towards finding a solution, try throwing in a couple of print() to verify that the data is correct/what you expect in all instances... Commented Jan 28, 2014 at 11:36
  • Or use pdb to step into the code. Commented Jan 28, 2014 at 15:21

2 Answers 2

1

Your split_line could be improved. Imagine you have this text:

s = """Add: 3rd Floor Khemha House
Drive in Road, Opp Drive in Cinema
Ahmedabad - 380 054
Gujarat
Tel: 91-79-7457611 , 79-7450378
Email: [email protected]
Website: http://www.aavishkaar.org
Contact: Angha Mitra
Purpose: Economics and Finance, Micro-enterprises
Aim/Objective/Mission: To provide timely financing, management support and professional expertise ..."""

Now we can turn this into lines using s.split("\n") (split on each new line), giving a list where each item is a line:

lines = s.split("\n")
lines == ['Add: 3rd Floor Khemha House', 
          'Drive in Road, Opp Drive in Cinema', 
          ...]

We can define a list of the elements we want to extract, and a dictionary to hold the results:

targets = ["Contact", "Purpose", "Email"]
results = {}

And work through each line, capturing the information we want:

for line in lines:
    l = line.split(":")
    if l[0] in targets:
        results[l[0]] = l[1]

This gives me:

results == {'Contact': ' Angha Mitra', 
            'Purpose': ' Economics and Finance, Micro-enterprises', 
            'Email': ' [email protected]'}
Sign up to request clarification or add additional context in comments.

4 Comments

using your code for the entire list on the website i get the details of 1st 6 but after them i get blank results.
Some of the pages have missing data, e.g. gujarat.ngosindia.com/… doesn't have "Website". This is to be expected.
i know some of them miss some data but why is other data which is present not coming ?
I don't know, you will have to put some prints in to find out what you're getting and how to process it.
0

Try to split the contents of the ngos site better, you can give the "split" method a regular expression to split by. e.g. "[Contact]+[Email]+[telephone number]+[website]+[purpose]+[contact person]

My regular expression could be wrong but this is the direction you should head in.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.