First thing first, I know there are many questions regarding python and lxml on StackOverflow already, and I did read most of them, if not all. Right now I am looking for a more comprehensive answer in this question.
I am doing some HTML conversion and I need to grammatically parse the HTML and then do some content changes to href, img and such.
This is a simplified version of what I have right now:
with open(fileName, "r") as inFile:
inputS = inFile.read()
myTree = fromstring(inputS) #parse etree from HTML content
breadCrumb = myTree.get_element_by_id("breadcrumb") #a list of elements with matching id
breadCrumbContent = breadCrumb[0].text_content().strip() #text content of bread crumb
h1 = myTree.xpath('//h1') #another way, get elements by xpath
h1Content = h1[0].text_content().strip() #get text content
getTail = myTree.cssselect('table.results > tr > td > a + span + br') #get list of elements using css select
So basically that's what I know at the moment. Is there any other ways to get elements/attributes using lxml? I know that they may not be the best way to do it but bear with me, i am new to this whole thing.
Following is what I want to do. I have:
<img src="images/macmail10.gif" alt="" width="555" height="485" /><br />
<a href="http://www.some_url.com/faq/general_faq.html" target="_blank">General FAQs page</a>
They can be nested inside other elements like div, p whatsoever. What I want to do is to programatically look for those elements; for image, I want to extract the src, do some manipulation with it and set src to something else (for example, src="images/something.jpg" into src="something_images.jpg"), the same thing with href, i want to change it to make it point to other place.
Other than that, I also want to remove some elements from the tree to simplify it, for example:
<head>
<title>something goes here</title>
</head>
<div>
<p id="some_p"> Some content </p>
</div>
I would want to remove the head node and the div, i would be able to get the p with id="some_p", is there any ways to grab the parent element? is there also any way to remove those elements? (in this case look for head, remove head and then look for id="some_p", get the parent and delete it.
Thank you!
==================================================
UPDATE: I already found the solution to this and already finished coding using lxml.etree. I will post the answer to that as soon as stackoverflow allows me. I truly hope that the answer for this question would be of help to other people when they have to deal with HTML parsing!