1

I have a huge xml file of 800 mb,that is the dblp dataset,but when i run my code, i get the following errors:-

I am doing the following operations in my code:-

1.Parse input file by lxml module
2.Get Title name from the User by raw_input()
3.Target article tag who title start with User input in step2.
4.Iterate every article tags from the step 3
5.Create List of list tuple which save all articles tag and its text information in th result.
6.Print result.

My code:-

import lxml.etree as ET
root = ET.parse('input.xml')

title = raw_input('enter the name: ')
articles = root.xpath('.//article[starts-with(title, "%s")]' % title)
result = []
for article in articles:
    tmp = []
    for i in article.getchildren():
        tmp.append((i.tag, i.text))

    result.append(tmp)

#- Print result:
for i in result:
    print "\n"
    for j in i:
        print "%s:%s"%(j[0], j[1])

Errors obtained:-

Traceback (most recent call last):
  File "C:/Python27/xmp2.py", line 2, in <module>
    root = ET.parse('myxml.xml')
  File "lxml.etree.pyx", line 3301, in lxml.etree.parse (src\lxml\lxml.etree.c:72453)
  File "parser.pxi", line 1791, in lxml.etree._parseDocument (src\lxml\lxml.etree.c:105915)
  File "parser.pxi", line 1817, in lxml.etree._parseDocumentFromURL (src\lxml\lxml.etree.c:106214)
  File "parser.pxi", line 1721, in lxml.etree._parseDocFromFile (src\lxml\lxml.etree.c:105213)
  File "parser.pxi", line 1122, in lxml.etree._BaseParser._parseDocFromFile (src\lxml\lxml.etree.c:100163)
  File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:94286)
  File "parser.pxi", line 690, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:95722)
  File "parser.pxi", line 620, in lxml.etree._raiseParseError (src\lxml\lxml.etree.c:94789)
XMLSyntaxError: Entity 'ouml' not defined, line 47, column 25

My xml looks like:-

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp.dtd">
<dblp>
<article mdate="2011-01-11" key="journals/acta/Saxena96">
<author>Sanjeev Saxena</author>
<title>Parallel Integer Sorting and Simulation Amongst CRCW Models.</title>
<pages>607-619</pages>
<year>1996</year>
<volume>33</volume>
<journal>Acta Inf.</journal>
<number>7</number>
<url>db/journals/acta/acta33.html#Saxena96</url>
<ee>http://dx.doi.org/10.1007/BF03036466</ee>
</article>

Please help me to solve my problem! Thanks in advance.

2 Answers 2

4

The error message

Entity 'ouml' not defined, line 47, column 25

means that at this position in your document, there appears the entity reference &ouml;, for which no definition has been found. (It's probably intended to represent o-with-umlaut, but unlike HTML, such entity names are not built-in to XML, they have to be defined in the DTD.)

It's possible of course that this is not the only occurrence of such an entity reference in your large source document.

Your document contains a reference to the DTD dblp.dtd. There are two possibilities: either the entity isn't defined in the DTD, or for some reason your parser isn't picking it up. I think that dplp is a well-known and widely-used dataset (created by people who are technically very competent) so I think the first explanation is unlikely, unless some incorrect preprocessing has taken place that corrupted the data. The second possibility seems more likely. At this point I can't help any more because I don't know anything about the Python parser that you are using or about its configuration settings.

Sign up to request clarification or add additional context in comments.

1 Comment

Just to add some information: For some weird historical reasons, the dblp.xml uses a local SYSTEM declaration of the dblp.dtd file. Hence, one needs to download the DTD file along with the XML file. Depending on the parser to use, the dblp.dtd needs to be copied to either the directory containing the dblp.xml, or the current working directory of the script (or both, just to be sure). Alternatively, one may of course also just edit the DOCTYPE declaration of the downloaded dblp.xml to use the publicly available DTD file at dblp.uni-trier.de/xml/dblp.dtd .
1

The problems is probably coming from the referenced "dblp.dtd".

"The purpose of a DTD is to define the structure of an XML document" which can have an entity defined like <!ENTITY ouml...>

Check this to solve your problem.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.