I am scraping a page, pulling data from a table, with the desired end product being a list of lists.
import urllib2
from bs4 import BeautifulSoup
html = BeautifulSoup(urllib2.urlopen('http://domain.com').read(), 'lxml')
tagged_data = [row('td') for row in html('table',{'id' : 'targeted_table'})[0]('tr') ]
# One of the <td>'s has an a tag in it that I need to grab the link from, hence the conditional
clean_data = [[(item.string if item.string is not None else ([item('a')[0].string, item('a')[0]['href']])) for item in info ] for info in tagged_data ]
The above code generates the following structure:
[[[u'data 01',
'http://domain1.com'],
u'data 02',
u'data 03',
u'data 04'],
[[u'data 11',
'http://domain2.com'],
u'data 12',
u'data 13',
u'data 14'],
[[u'data 01',
'http://domain1.com'],
u'data 22',
u'data 23',
u'data 24']]
But what I'd really like is:
[[u'data 01',
u'http://domain1.com',
u'data 02',
u'data 03',
u'data 04'],
[u'data 11',
u'http://domain2.com',
u'data 12',
u'data 13',
u'data 14'],
[u'data 01',
u'http://domain1.com',
u'data 22',
u'data 23',
u'data 24']]
I also tried:
clean_data = [[(item.string if item.string is not None else (item('a')[0].string, item('a')[0]['href'])) for item in info ] for info in tagged_data ]
But it puts a tuple(I think) in the first item of the sublist.
[(u'data01',
'http://domain1.com'),
u'data02',
u'data03',
u'data04']
So suggestions?
Example Data
<table id='targeted_table'>
<tr>
<td><a href="http://domain.com">data 01</a></td>
<td>data 02</td>
<td>data 03</td>
<td>data 04</td>
</tr>
<tr>
<td><a href="http://domain.com">data 11</a></td>
<td>data 12</td>
<td>data 13</td>
<td>data 14</td>
</tr>
<tr>
<td><a href="http://domain.com">data 01</a></td>
<td>data 22</td>
<td>data 23</td>
<td>data 24</td>
</tr>
<tr>
<td><a href="http://domain.com">data 01</a></td>
<td>data 32</td>
<td>data 33</td>
<td>data 34</td>
</tr>
</table>