0

I am scraping a page, pulling data from a table, with the desired end product being a list of lists.

import urllib2
from bs4 import BeautifulSoup

html = BeautifulSoup(urllib2.urlopen('http://domain.com').read(), 'lxml')
tagged_data = [row('td') for row in html('table',{'id' : 'targeted_table'})[0]('tr') ]

# One of the <td>'s has an a tag in it that I need to grab the link from, hence the conditional
clean_data = [[(item.string if item.string is not None else ([item('a')[0].string, item('a')[0]['href']])) for item in info ] for info in tagged_data ]

The above code generates the following structure:

[[[u'data 01',
 'http://domain1.com'],
u'data 02',
u'data 03',
u'data 04'],
[[u'data 11',
 'http://domain2.com'],
u'data 12',
u'data 13',
u'data 14'],
[[u'data 01',
 'http://domain1.com'],
u'data 22',
u'data 23',
u'data 24']]

But what I'd really like is:

[[u'data 01',
u'http://domain1.com',
u'data 02',
u'data 03',
u'data 04'],
[u'data 11',
u'http://domain2.com',
u'data 12',
u'data 13',
u'data 14'],
[u'data 01',
u'http://domain1.com',
u'data 22',
u'data 23',
u'data 24']]

I also tried:

clean_data = [[(item.string if item.string is not None else (item('a')[0].string, item('a')[0]['href'])) for item in info ] for info in tagged_data ]

But it puts a tuple(I think) in the first item of the sublist.

[(u'data01',
'http://domain1.com'),
u'data02',
u'data03',
u'data04']

So suggestions?

Example Data

<table id='targeted_table'>
    <tr>
        <td><a href="http://domain.com">data 01</a></td>
        <td>data 02</td>
        <td>data 03</td>
        <td>data 04</td>
    </tr>
    <tr>
        <td><a href="http://domain.com">data 11</a></td>
        <td>data 12</td>
        <td>data 13</td>
        <td>data 14</td>
    </tr>
    <tr>
        <td><a href="http://domain.com">data 01</a></td>
        <td>data 22</td>
        <td>data 23</td>
        <td>data 24</td>
    </tr>
    <tr>
        <td><a href="http://domain.com">data 01</a></td>
        <td>data 32</td>
        <td>data 33</td>
        <td>data 34</td>
    </tr>
</table>
2
  • @voithos answered my original hypothetical question. Commented Aug 1, 2013 at 21:41
  • @unutbu provides a better overall solution to my problem. Commented Aug 2, 2013 at 0:17

2 Answers 2

2

The line

html = BeautifulSoup(urllib2.urlopen('http://domain.com').read(), 'lxml')

implies you have lxml installed, so you could use an XPath using | to pull out text or attribute values:

import urllib2
import lxml.html as LH

html = LH.parse(urllib2.urlopen('http://domain.com'))

clean_data = [[elt for elt in tr.xpath('td/a/text() | td/a/@href | td/text()')]
              for tr in html.xpath('//table[@id="targeted_table"]/tr')]
print(clean_data)

yields

[['http://domain.com', 'data 01', 'data 02', 'data 03', 'data 04'], 
 ['http://domain.com', 'data 11', 'data 12', 'data 13', 'data 14'], 
 ['http://domain.com', 'data 01', 'data 22', 'data 23', 'data 24'],
 ['http://domain.com', 'data 01', 'data 32', 'data 33', 'data 34']]

You could also do it with a single call to the xpath method:

pieces = iter(html.xpath('''//table[@id="targeted_table"]/tr/td/a/text()
                            | //table[@id="targeted_table"]/tr/td/a/@href
                            | //table[@id="targeted_table"]/tr/td/text()'''))
clean_data = zip(*[pieces]*5)
Sign up to request clarification or add additional context in comments.

2 Comments

What about the link? Since I need the href as well.
Sorry, I missed that you wanted those. I've edited the XPath to extract the links as well. The order of the elements is different than what you posted, but perhaps that's okay, maybe even preferable?
1

You're trying to have the list comprehension emit two elements some of the time, and a single element at other times.

You can do something like this by enclosing a comprehension over your "one if [criteria] else two" code.

clean_data = [[res for item in info for res in (
                  [item.string] if item.string is not None else
                  ([item('a')[0].string, item('a')[0]['href']])
              )]
              for info in tagged_data]

Granted, I don't think this method is very clean. If you're parsing HTML / XML, I'd recommend that you use the tools for the job and avoid messy tree traversal.

2 Comments

well it sort of works, it is a single list, however there are 4 duplicates per td string. What other tools would you recommend?
@miah: Whoops, typo. Try now.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.