conditionally adding multiple items to a list of lists via python list comprehension

Question

I am scraping a page, pulling data from a table, with the desired end product being a list of lists.

import urllib2
from bs4 import BeautifulSoup

html = BeautifulSoup(urllib2.urlopen('http://domain.com').read(), 'lxml')
tagged_data = [row('td') for row in html('table',{'id' : 'targeted_table'})[0]('tr') ]

# One of the <td>'s has an a tag in it that I need to grab the link from, hence the conditional
clean_data = [[(item.string if item.string is not None else ([item('a')[0].string, item('a')[0]['href']])) for item in info ] for info in tagged_data ]

The above code generates the following structure:

[[[u'data 01',
 'http://domain1.com'],
u'data 02',
u'data 03',
u'data 04'],
[[u'data 11',
 'http://domain2.com'],
u'data 12',
u'data 13',
u'data 14'],
[[u'data 01',
 'http://domain1.com'],
u'data 22',
u'data 23',
u'data 24']]

But what I'd really like is:

[[u'data 01',
u'http://domain1.com',
u'data 02',
u'data 03',
u'data 04'],
[u'data 11',
u'http://domain2.com',
u'data 12',
u'data 13',
u'data 14'],
[u'data 01',
u'http://domain1.com',
u'data 22',
u'data 23',
u'data 24']]

I also tried:

clean_data = [[(item.string if item.string is not None else (item('a')[0].string, item('a')[0]['href'])) for item in info ] for info in tagged_data ]

But it puts a tuple(I think) in the first item of the sublist.

[(u'data01',
'http://domain1.com'),
u'data02',
u'data03',
u'data04']

So suggestions?

Example Data

<table id='targeted_table'>
    <tr>
        <td><a href="http://domain.com">data 01</a></td>
        <td>data 02</td>
        <td>data 03</td>
        <td>data 04</td>
    </tr>
    <tr>
        <td><a href="http://domain.com">data 11</a></td>
        <td>data 12</td>
        <td>data 13</td>
        <td>data 14</td>
    </tr>
    <tr>
        <td><a href="http://domain.com">data 01</a></td>
        <td>data 22</td>
        <td>data 23</td>
        <td>data 24</td>
    </tr>
    <tr>
        <td><a href="http://domain.com">data 01</a></td>
        <td>data 32</td>
        <td>data 33</td>
        <td>data 34</td>
    </tr>
</table>

@voithos answered my original hypothetical question.

miah
– miah

2013-08-01 21:41:52 +00:00
Commented Aug 1, 2013 at 21:41 — miah
– miah, Commented Aug 1, 2013 at 21:41
@unutbu provides a better overall solution to my problem.

miah
– miah

2013-08-02 00:17:12 +00:00
Commented Aug 2, 2013 at 0:17 — miah
– miah, Commented Aug 2, 2013 at 0:17

unutbu · Accepted Answer · 2013-08-01 21:37:02Z

2

The line

html = BeautifulSoup(urllib2.urlopen('http://domain.com').read(), 'lxml')

implies you have lxml installed, so you could use an XPath using | to pull out text or attribute values:

import urllib2
import lxml.html as LH

html = LH.parse(urllib2.urlopen('http://domain.com'))

clean_data = [[elt for elt in tr.xpath('td/a/text() | td/a/@href | td/text()')]
              for tr in html.xpath('//table[@id="targeted_table"]/tr')]
print(clean_data)

yields

[['http://domain.com', 'data 01', 'data 02', 'data 03', 'data 04'], 
 ['http://domain.com', 'data 11', 'data 12', 'data 13', 'data 14'], 
 ['http://domain.com', 'data 01', 'data 22', 'data 23', 'data 24'],
 ['http://domain.com', 'data 01', 'data 32', 'data 33', 'data 34']]

You could also do it with a single call to the xpath method:

pieces = iter(html.xpath('''//table[@id="targeted_table"]/tr/td/a/text()
                            | //table[@id="targeted_table"]/tr/td/a/@href
                            | //table[@id="targeted_table"]/tr/td/text()'''))
clean_data = zip(*[pieces]*5)

edited Aug 1, 2013 at 21:37

answered Aug 1, 2013 at 21:20

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

miah Over a year ago

What about the link? Since I need the href as well.

unutbu Over a year ago

Sorry, I missed that you wanted those. I've edited the XPath to extract the links as well. The order of the elements is different than what you posted, but perhaps that's okay, maybe even preferable?

voithos · Accepted Answer · 2013-08-01 21:32:43Z

1

You're trying to have the list comprehension emit two elements some of the time, and a single element at other times.

You can do something like this by enclosing a comprehension over your "one if [criteria] else two" code.

clean_data = [[res for item in info for res in (
                  [item.string] if item.string is not None else
                  ([item('a')[0].string, item('a')[0]['href']])
              )]
              for info in tagged_data]

Granted, I don't think this method is very clean. If you're parsing HTML / XML, I'd recommend that you use the tools for the job and avoid messy tree traversal.

edited Aug 1, 2013 at 21:32

answered Aug 1, 2013 at 21:23

voithos

70.9k12 gold badges107 silver badges120 bronze badges

2 Comments

miah Over a year ago

well it sort of works, it is a single list, however there are 4 duplicates per td string. What other tools would you recommend?

voithos Over a year ago

@miah: Whoops, typo. Try now.

Collectives™ on Stack Overflow

conditionally adding multiple items to a list of lists via python list comprehension

2 Answers 2

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related