get element and change element text with python and lxml

Question

First thing first, I know there are many questions regarding python and lxml on StackOverflow already, and I did read most of them, if not all. Right now I am looking for a more comprehensive answer in this question.

I am doing some HTML conversion and I need to grammatically parse the HTML and then do some content changes to href, img and such.

This is a simplified version of what I have right now:

with open(fileName, "r") as inFile:
    inputS = inFile.read()

myTree = fromstring(inputS) #parse etree from HTML content

breadCrumb = myTree.get_element_by_id("breadcrumb") #a list of elements with matching id
breadCrumbContent = breadCrumb[0].text_content().strip() #text content of bread crumb

h1 = myTree.xpath('//h1') #another way, get elements by xpath
h1Content = h1[0].text_content().strip() #get text content

getTail = myTree.cssselect('table.results > tr > td > a + span + br') #get list of elements using css select

So basically that's what I know at the moment. Is there any other ways to get elements/attributes using lxml? I know that they may not be the best way to do it but bear with me, i am new to this whole thing.

Following is what I want to do. I have:

<img src="images/macmail10.gif" alt="" width="555" height="485" /><br />
<a href="http://www.some_url.com/faq/general_faq.html" target="_blank">General FAQs page</a>

They can be nested inside other elements like div, p whatsoever. What I want to do is to programatically look for those elements; for image, I want to extract the src, do some manipulation with it and set src to something else (for example, src="images/something.jpg" into src="something_images.jpg"), the same thing with href, i want to change it to make it point to other place.

Other than that, I also want to remove some elements from the tree to simplify it, for example:

<head>
    <title>something goes here</title>
</head>
<div>
    <p id="some_p"> Some content </p>
</div>

I would want to remove the head node and the div, i would be able to get the p with id="some_p", is there any ways to grab the parent element? is there also any way to remove those elements? (in this case look for head, remove head and then look for id="some_p", get the parent and delete it.

Thank you!

==================================================

UPDATE: I already found the solution to this and already finished coding using lxml.etree. I will post the answer to that as soon as stackoverflow allows me. I truly hope that the answer for this question would be of help to other people when they have to deal with HTML parsing!

+1 for a clear question. (and not trying to use a regex!)

the wolf
– the wolf

2011-09-16 19:49:02 +00:00
Commented Sep 16, 2011 at 19:49 — the wolf
– the wolf, Commented Sep 16, 2011 at 19:49

the wolf · Accepted Answer · 2011-09-16 20:11:45Z

1

lxml and ElementTree are quite similar. The ElementTree portion of the lxml documentation site, in fact, just points to ElementTree's documentation.

You might try working through the ElementTree tutorials and examples at the bottom of the overview page. Since ElementTree is part of the Python distribution, it tends to be widely documented (and easily Googled). Once you grok that, extend with some of the lmlx magic not initial found in ElementTree if you need to. For example, lxml maintains parent relationships for every element and ElementTree does not. You can add parent relationships to ElementTree, but it is not an easy example to start with.

That how I learned it.

edited Sep 16, 2011 at 20:11

answered Sep 16, 2011 at 20:02

the wolf

35.7k13 gold badges57 silver badges73 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

the wolf Over a year ago

@Tanner Hoang: You can use etree. I was suggesting that you use the tutorials and examples from ElementTree on their site, since it fully documented. You can code and test in etree from lxml but use the ElementTree materials as a reference for the etree part of lxml. The code for pretty much the same. This was my point.

Collectives™ on Stack Overflow

get element and change element text with python and lxml

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related