Extracting nested namespace from a xml using lxml

Question

I'm new to Python and currently learning to parse XML. All seems to be going well until I hit a wall with nested namespaces.

Below is an snippet of my xml ( with a beginning and child element that I'm trying to parse:

<?xml version="1.0" encoding="UTF-8"?>
-<CompositionPlaylist xmlns="http://www.digicine.com/PROTO-ASDCP-CPL-20040511#">
<!-- Generated by orca_wrapping version 3.8.3-0 -->
<Id>urn:uuid:e0e43007-ca9b-4ed8-97b9-3ac9b272be7a</Id>
-------------
-------------
------------- 
-<cc-cpl:MainClosedCaption xmlns:cc-cpl="http://www.digicine.com/PROTO- ASDCP-CC-CPL-20070926#"><Id>urn:uuid:0607e57f-edcc-46ec- 997a-d2fbc0c1ea3a</Id><EditRate>24 1</EditRate><IntrinsicDuration>2698</IntrinsicDuration></cc-cpl:MainClosedCaption>
------------
------------
------------
</CompositionPlaylist>

What I'm need is a solution to extract the URI of the local name 'MainClosedCaption'. In this case, I'm trying to extract the string "http://www.digicine.com/PROTO- ASDCP-CC-CPL-20070926#". I looked through a lot of tutorials but cannot seems to find a solution.

If there's anyone out there can lend your expertise, it would be much appreciated.

Here what I did so far with the help from the two contributors:

#!/usr/bin/env python

from xml.etree import ElementTree as ET #import ElementTree module as an alias ET
from lxml import objectify, etree

def parse():

import os
import sys
cpl_file = sys.argv[1]
xml_file = os.path.abspath(__file__)
xml_file = os.path.dirname(xml_file)
xml_file = os.path.join(xml_file,cpl_file)

with open(xml_file)as f:
    xml = f.read()

tree = etree.XML(xml)

caption_namespace = etree.QName(tree.find('.//{*}MainClosedCaption')).namespace

print caption_namespace
print tree.nsmap

nsmap = {}

for ns in tree.xpath('//namespace::*'):
    if ns[0]:
        nsmap[ns[0]] = ns[1]
tree.xpath('//cc-cpl:MainClosedCaption', namespace=nsmap)

return nsmap


if __name__=="__main__":

parse()

But it's not working so far. I got the result 'None' when I used QName to locate the tag and its namespace. And when I try to locate all namespace in the XML using for loop as suggested in another post, I got the error 'Unknown return type: dict'

Any suggestions pls?

I'm not following your description. In this example, exactly what string are you trying to extract? — David
– David, Commented May 8, 2015 at 0:01
I'm tryng to extract the namespace of the associated with the tag 'MainClosedCaption' — Daniel Tan
– Daniel Tan, Commented May 8, 2015 at 0:21
In this case, the string that I'm trying to extract from the xml is 'digicine.com/PROTO- ASDCP-CC-CPL-20070926#' — Daniel Tan
– Daniel Tan, Commented May 8, 2015 at 0:22
@DanielTan Post some codes showing what you have tried so far. It is always easier for people to suggest solution based on what you have, instead of starting over from scratch. And usually, that kind of solution is easier for asker to understand too. — har07
– har07, Commented May 8, 2015 at 1:23

Robᵩ · Accepted Answer · 2015-05-08 02:35:20Z

2

This program prints the namespace of the indicated tag:

from lxml import etree

xml = etree.XML('''<?xml version="1.0" encoding="UTF-8"?>
<CompositionPlaylist xmlns="http://www.digicine.com/PROTO-ASDCP-CPL-20040511#">
<!-- Generated by orca_wrapping version 3.8.3-0 -->
<Id>urn:uuid:e0e43007-ca9b-4ed8-97b9-3ac9b272be7a</Id>
<cc-cpl:MainClosedCaption xmlns:cc-cpl="http://www.digicine.com/PROTO-ASDCP-CC-CPL-20070926#">
<Id>urn:uuid:0607e57f-edcc-46ec- 997a-d2fbc0c1ea3a</Id>
<EditRate>24 1</EditRate>
<IntrinsicDuration>2698</IntrinsicDuration>
</cc-cpl:MainClosedCaption>
</CompositionPlaylist>
''')

print etree.QName(xml.find('.//{*}MainClosedCaption')).namespace

Result:

http://www.digicine.com/PROTO-ASDCP-CC-CPL-20070926#

Reference: http://lxml.de/tutorial.html#namespaces

answered May 8, 2015 at 2:35

Robᵩ

170k20 gold badges251 silver badges323 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Daniel Tan Over a year ago

I did what you suggested but got 'None' as a result. Please see my original post for my codes.

Robᵩ Over a year ago

When I run the code in your question against the XML in your question, I get http://www.digicine.com/PROTO-ASDCP-CC-CPL-20070926#. (Of course, I have to fix the typos in your XML first.) Perhaps the XML snippet in your question doesn't represent the XML you are actually using?

Daniel Tan Over a year ago

the complete XML is different with more child elements with the root tag. But I have also copied the exact code that you pasted here and I get 'None' as well.

Robᵩ Over a year ago

I'm sorry, but I have no idea why we would each get different output from the exact same program.

kieran Over a year ago

By the way, Rob's suggestion worked for me. I'm currently having difficulty extract the //MainClosedCaption/Id element. stackoverflow.com/questions/37038148/…

Collectives™ on Stack Overflow

Extracting nested namespace from a xml using lxml

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related