2

First thing first, I know there are many questions regarding python and lxml on StackOverflow already, and I did read most of them, if not all. Right now I am looking for a more comprehensive answer in this question.

I am doing some HTML conversion and I need to grammatically parse the HTML and then do some content changes to href, img and such.

This is a simplified version of what I have right now:

with open(fileName, "r") as inFile:
    inputS = inFile.read()

myTree = fromstring(inputS) #parse etree from HTML content

breadCrumb = myTree.get_element_by_id("breadcrumb") #a list of elements with matching id
breadCrumbContent = breadCrumb[0].text_content().strip() #text content of bread crumb

h1 = myTree.xpath('//h1') #another way, get elements by xpath
h1Content = h1[0].text_content().strip() #get text content

getTail = myTree.cssselect('table.results > tr > td > a + span + br') #get list of elements using css select

So basically that's what I know at the moment. Is there any other ways to get elements/attributes using lxml? I know that they may not be the best way to do it but bear with me, i am new to this whole thing.

Following is what I want to do. I have:

<img src="images/macmail10.gif" alt="" width="555" height="485" /><br />
<a href="http://www.some_url.com/faq/general_faq.html" target="_blank">General FAQs page</a>

They can be nested inside other elements like div, p whatsoever. What I want to do is to programatically look for those elements; for image, I want to extract the src, do some manipulation with it and set src to something else (for example, src="images/something.jpg" into src="something_images.jpg"), the same thing with href, i want to change it to make it point to other place.

Other than that, I also want to remove some elements from the tree to simplify it, for example:

<head>
    <title>something goes here</title>
</head>
<div>
    <p id="some_p"> Some content </p>
</div>

I would want to remove the head node and the div, i would be able to get the p with id="some_p", is there any ways to grab the parent element? is there also any way to remove those elements? (in this case look for head, remove head and then look for id="some_p", get the parent and delete it.

Thank you!

==================================================

UPDATE: I already found the solution to this and already finished coding using lxml.etree. I will post the answer to that as soon as stackoverflow allows me. I truly hope that the answer for this question would be of help to other people when they have to deal with HTML parsing!

1
  • +1 for a clear question. (and not trying to use a regex!) Commented Sep 16, 2011 at 19:49

1 Answer 1

1

lxml and ElementTree are quite similar. The ElementTree portion of the lxml documentation site, in fact, just points to ElementTree's documentation.

You might try working through the ElementTree tutorials and examples at the bottom of the overview page. Since ElementTree is part of the Python distribution, it tends to be widely documented (and easily Googled). Once you grok that, extend with some of the lmlx magic not initial found in ElementTree if you need to. For example, lxml maintains parent relationships for every element and ElementTree does not. You can add parent relationships to ElementTree, but it is not an easy example to start with.

That how I learned it.

Sign up to request clarification or add additional context in comments.

1 Comment

@Tanner Hoang: You can use etree. I was suggesting that you use the tutorials and examples from ElementTree on their site, since it fully documented. You can code and test in etree from lxml but use the ElementTree materials as a reference for the etree part of lxml. The code for pretty much the same. This was my point.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.