5

I am using the below to get all of the html content of a section to save to a database

el = doc.get_element_by_id('productDescription')
lxml.html.tostring(el)

The product description has a tag that looks like this:

<div id='productDescription'>

     <THE HTML CODE I WANT>

</div>

The code works great , gives me all of the html code but how do I remove the outer layer i.e. the <div id='productDescription'> and the closing tag </div> ?

0

4 Answers 4

3

You could convert each child to string individually:

text = el.text
text += ''.join(map(lxml.html.tostring, el.iterchildren()))

Or in even more hackish way:

el.attrib.clear()
el.tag = '|||'
text = lxml.html.tostring(el)
assert text.startswith('<'+el.tag+'>') and text.endswith('</'+el.tag+'>')
text = text[len('<'+el.tag+'>'):-len('</'+el.tag+'>')]
Sign up to request clarification or add additional context in comments.

Comments

3

if your productDescription div div contains mixed text/elements content, e.g.

<div id='productDescription'>
  the
  <b> html code </b>
  i want
</div>

you can get the content (in string) using xpath('node()') traversal:

s = ''
for node in el.xpath('node()'):
    if isinstance(node, basestring):
        s += node
    else:
        s += lxml.html.tostring(node, with_tail=False)

1 Comment

What is basestring?
0

Here is a function that does what you want.

def strip_outer(xml):
    """
    >>> xml = '''<math xmlns="http://www.w3.org/1998/Math/MathML" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/1998/Math/MathML         http://www.w3.org/Math/XMLSchema/mathml2/mathml2.xsd">
    ...   <mrow>
    ...     <msup>
    ...       <mi>x</mi>
    ...       <mn>2</mn>
    ...     </msup>
    ...     <mo> + </mo>
    ...     <mi>x</mi>
    ...   </mrow>
    ... </math>'''
    >>> so = strip_outer(xml)
    >>> so.splitlines()[0]=='<mrow>'
    True

    """
    xml = xml.replace('xmlns=','xmlns:x=')#lxml fails with xmlns= attribute
    xml = '<root>\n'+xml+'\n</root>'#...and it can't strip the root element
    rx = lxml.etree.XML(xml)
    lxml.etree.strip_tags(rx,'math')#strip <math with all attributes
    uc=lxml.etree.tounicode(rx)
    uc=u'\n'.join(uc.splitlines()[1:-1])#remove temporary <root> again
    return uc.strip()

Comments

0

Use regexp.

def strip_outer_tag(html_fragment):
    import re
    outer_tag = re.compile(r'^<[^>]+>(.*?)</[^>]+>$', re.DOTALL)
    return outer_tag.search(html_fragment).group(1)

html_fragment = strip_outer_tag(tostring(el, encoding='unicode'))  # `encoding` is optionaly

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.