0

I am trying to extract the information from a link from a page that is structured as such:

...

<td align="left" bgcolor="#FFFFFF">$725,000</td>

<td align="left" bgcolor="#FFFFFF"> Available</td>

*<td align="left" bgcolor="#FFFFFF">
    <a href="/washington">


 Washington Street Studios
<br>1410 Washington Street SW<br>Albany, Oregon, 97321
</a>
</td>*

<td align="center" bgcolor="#FFFFFF">15</td>

<td align="center" bgcolor="#FFFFFF">8.49%</td>

<td align="center" bgcolor="#FFFFFF">$48,333</td>

</tr>

I tried targeting elements with attribute 'align = left' and iterating over it but that didn't work out. If anybody could help me locate the element <a href = "/washington"> (multiple tags like these within the same page) with selenium I would appreciate it.

1
  • Can you post more tr rows so we can get a clear picture of where the desired links are located? Thanks. Commented Sep 10, 2015 at 12:04

2 Answers 2

0

I would use lxml instead, if it is just to process hxml...

It would be helpful if you're more specific, but you can try this if you are traversing links in a webpage..

from lxml.html import parse

pdoc = parse(url_of_webpage)
doc = pdoc.getroot()
list_of_links = [i[2] for i in  doc.iterlinks()]

list_of_links will look like ['/en/images/logo_com.gif', 'http://www.brand.com/', '/en/images/logo.gif']

doc.iterlinks() will look for all links such as form, img, a-tags and yield lists containing Element object containing the tag, the type of tag (form, a or img), the url and a number, so the line

list_of_links = [i[2] for i in  doc.iterlinks()]
simply grab the url and returns as a separate list.

Note that the retrieved url is relative. As in you will see urls like

'/en/images/logo_com.gif'

instead of

'http://somedomain.com/en/images/logo_com.gif'

if you want to have the latter kind of url, add the code

from lxml.html import parse
pdoc = parse(url_of_webpage)
doc = pdoc.getroot()
doc.make_links_absolute()     #  add this line
list_of_links = [i[2] for i in  doc.iterlinks()]

If you are processing the url one by one, then simply modify the code to something like

for i in iterlinks():
    url = i[2]
    # some processing here with url...

Finally, if for some reason you need selenium to come in to get the webpage content, then simply add the following to the beginning

from selenium import webdriver
from StringIO import StringIO

browser = webdriver.Firefox()
browser.get(url)
doc = parse(StringIO(browser.page_source)).getroot()
Sign up to request clarification or add additional context in comments.

Comments

0

From what we have provided at the moment, there is a table and you have the desired links in a specific column. There are no "data-oriented" attributes to rely on, but using column index to locate the links looks good enough:

for row in driver.find_elements_by_css_selector("table#myid tr"):
    cells = row.find_elements_by_tag_name("td")

    print(cells[2].text)  # put a correct index here

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.