0

I'm trying to scrape some web content and I'm having a hard time with formatting the output. My code generates a list, then iterates over that list to add more information to it. I'm getting all of the data that I need, but when I try to save it to a CSV I'm getting more than one list per line. I'm not sure how to accomplish this.

Here's my code:

def getPeople(company, url, filename):
  persList = []
  category = os.path.splitext(filename)[0]
  code = urllib.urlopen(url)
  html = code.read()
  unic = unicode(html, errors='ignore')
  tree = etree.parse(StringIO(unic), parser)
  personNames = tree.xpath('//a[@class="person"]/text()')
  personUrls = tree.xpath('//a[@class="person"]/@href')
  for i, j in zip(personNames, personUrls):
    personInfo = (company, category, i, j)
    internal = list(personInfo)
    persList.append(internal)
  result = list(persList)
  return result

def tasker(filename):
  peopleList = []
  companyNames = getCompanies(filename, '//a[@class="company"]/text()')
  companyUrls = getCompanies(filename, '//a[@class="company"]/@href')
  for i, j in zip(companyNames, companyUrls):
    peopleLinks = getPeople(i, j, filename)
    internal = list(peopleLinks)
    peopleList.append(internal)
  output = csv.writer(open("test.csv", "wb"))
  for row in itertools.izip_longest(*peopleList):
    output.writerow(row)
  return peopleList

Here's a sample of the output:

[[['3M', 'USA', 'Rod Thomas', 'http://site.com/ron-thomas'], ['HP', 'USA', 'Todd Gack', 'http://site.com/todd-gack'], ['Dell', 'USA', 'Phil Watters', 'http://site.com/philwatt-1'], ['IBM', 'USA', 'Mary Sweeney', 'http://site.com/ms2105']], [['3M', 'USA', 'Tom Hill', 'http://site.com/tomhill'], None, ['Dell', 'USA', 'Howard Duck', 'http://site.com/howard-duck'], None], [['3M', 'USA', 'Neil Rallis', 'http://site.com/nrallis-4'], None, None, None]]

This makes for an ugly CSV file that's difficult to read. Can anyone point me in the right direction?

EDIT: This is what I'd like the output to look like.

[['3M', 'USA', 'Rod Thomas', 'http://site.com/ron-thomas'], ['HP', 'USA', 'Todd Gack', 'http://site.com/todd-gack'], ['Dell', 'USA', 'Phil Watters', 'http://site.com/philwatt-1'], ['IBM', 'USA', 'Mary Sweeney', 'http://site.com/ms2105'], ['3M', 'USA', 'Tom Hill', 'http://site.com/tomhill'], ['Dell', 'USA', 'Howard Duck', 'http://site.com/howard-duck'], ['3M', 'USA', 'Neil Rallis', 'http://site.com/nrallis-4']]
4
  • 1
    Can you please post what your desired format is? Commented Nov 13, 2012 at 21:39
  • Good question. I just updated the question to include the output that I'd like to see. Commented Nov 13, 2012 at 21:51
  • 2
    In general, for this kind of task, you want to separate the code that navigates the website from your logic. Write functions, name them appropriately, and them call them from a function that controls your logic. Commented Nov 13, 2012 at 21:52
  • @kreativitea That's good feedback, I appreciate you saying so. I'm self-taught & don't get much feedback on my code/style. I'll think through how I could improve. In this example the lists make it more challenging for me. It would have been easier to do in R (for me anyway), but I'm forcing myself to get more comfortable with Python. Commented Nov 14, 2012 at 0:05

1 Answer 1

4

In your line:

peopleList.append(internal)

You are appending one list into another. This makes the internal list become a member of the peopleList.

Instead you want to extend the peopleList. That's how you combine two lists.

So it would be:

peopleList.extend(internal)
Sign up to request clarification or add additional context in comments.

1 Comment

This is so simple I'm smacking my head. This does exactly what I wanted. I was wondering why my output had None in it, and how I was going to address that. This simple change took care of that issue as well. Thanks for the tip!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.