Full text searching XML data with Python: best practices, pros & cons

Question

Task

I want to use Python for doing full text searches of XML data.

Example data

<elements>
  <elem id="1">some element</elem>
  <elem id="2">some other element</elem>
  <elem id="3">some element
    <nested id="1">
    other nested element
    </nested>
  </elem>
</elements>

Basic functionality

The most basic functionality I want is that a search for "other" in an XPath ("/elements/elem") returns at least the value of the ID attribute for the matching element (elem 2) and nested element (elem 3, nested 1) or the matching XPaths.

Ideal functionality

The solution should be flexible and scalable. I am looking for possible combinations of these features:

search nested elements (infinite depth)
search attributes
search for sentences and paragraphs
search using wildcards
search using fuzzy matching
return precise matching info
good search speed for large XML files

Question

I don't expect a solution with all of the ideal functionality, I'll have to combine different existing functionalities and code the rest myself. But first I would like to know more about what there is out there, which libraries and approaches you would usually use for this, what their pros and cons are.

EDIT: Thanks for the answers so far, I added detail and started a bounty.

Good question, +1. See my answer for the two most powerful and standardized solutions. — Dimitre Novatchev
– Dimitre Novatchev, Commented Apr 28, 2011 at 19:41

Steven · Accepted Answer · 2011-05-05 11:51:58Z

6

Not sure if that will be enough for your needs, but lxml has support for regular expressions in xpath (meaning: you can use xpath 1.0 plus the EXSLT extension functions for regular expressions)

Compared to the feature list that was added later:

search nested elements (infinite depth): yes
search attributes: yes
search for sentences and paragraphs: no. Assuming that "paragraphs" are actual xml elements, then yes. But "sentences" as such, no.
search using wildcards: yes (regular expressions)
search using fuzzy matching: no (assuming stemming, synonyms and so on...)
return precise matching info: yes
good search speed for large XML files: yes, except when your files are so extremely large that you would actually need a fulltext index to get good speed anyway.

The only way to satisfy all your request that I see, would be to load your files into a native xml database that supports "real" fulltext search (via XQuery Fulltext probably) and use that. (can't help you much further with that, maybe Sedna, which seems to have a python API and seems to supports fulltext search?)

edited May 5, 2011 at 11:51

answered Apr 26, 2011 at 20:50

Steven

28.9k6 gold badges64 silver badges51 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

lecodesportif Over a year ago

@Jakob Bowyer: lxml is nice, but why is it the only way to go? How does it compare to options in the other answers?

user357812 Over a year ago

@locodesportif: It doesn't get close to XPath 2.0, for example.

Max · Accepted Answer · 2011-04-26 15:14:58Z

2

I think you would be best served using a full text search engine like Solr: http://lucene.apache.org/solr/

What you can do is store a "document" in solr for each <elem /> in your xml. You can store any data you like in the document. Then you can search against the index and grab the id field stored in the matching documents. This will be very fast for a large set of documents.

answered Apr 26, 2011 at 15:14

Max

7,1397 gold badges49 silver badges62 bronze badges

6 Comments

lecodesportif Over a year ago

What's the overhead of Solr? Is there any Python integration?

Max Over a year ago

Solr is an indexing and search server. As for overhead, it depends on what you mean by that. If you want to search a large amount of these documents, using a full text search engine is really your best option. If you just have a few, it might not be worth it. With Solr you have to run a server, specify the format of the documents, and send them to it to be indexed. There is python integration, but it's also a simple api that responds with JSON so you can use it from anywhere. I highly recommend it!

lecodesportif Over a year ago

Thanks Max, this is something I'll keep in mind. But I need a more simple, easier to deploy, search solution for single, large XML files accessed by a Python script.

user177800 Over a year ago

Solr is an extension of Lucence which is Java. Huge footprint and complexity for something as simple as these requirements.

Max Over a year ago

I don't agree. His ideal requirements seem better suited for a full text search engine than a regex.

|

Community · Accepted Answer · 2017-05-23 12:03:12Z

1

select="/elements/elem//[contains(.,'other')]"

see also xpath: find a node that has a given attribute whose value contains a string

edited May 23, 2017 at 12:03

CommunityBot

11 silver badge

answered Apr 26, 2011 at 13:21

vartec

135k38 gold badges227 silver badges248 bronze badges

2 Comments

lecodesportif Over a year ago

What are the constraints of this? Can I search for text that includes linebreaks and use wildcards?

vartec Over a year ago

contains() is just substring search. If you need 'wildcards', use matches().

Dimitre Novatchev · Accepted Answer · 2011-04-28 19:40:17Z

1

I'd recommend the following two:

Use XPath 2.0. It supports regular expressions.

Or,

Use XQuery and XPath (2.0) Full Text, which has even more powerful features.

answered Apr 28, 2011 at 19:40

Dimitre Novatchev

244k27 gold badges308 silver badges438 bronze badges

Comments

Bryce Siedschlaw · Accepted Answer · 2011-05-04 22:06:00Z

So, recently I had to create a XML to JSON converter. It doesn't conform exactly to the JSON standard, but it comes pretty close. The xml2json function returns a dictionary representation of the xml object. All element attributes are included in a dictionary with a key of attributes and element text are included in the text key.

For example, your xml object would look like this after its conversion:

json = {'elements': 
    {'elem': [
        {'attributes': {'id', '1'}, 'text': 'some element'},
        {'attributes': {'id', '2'}, 'text': 'some other element'},
        {'attributes': {'id', '3'}, 'text': 'some element', 'nested': {
            'attributes': {'id', '1'}, 'text': 'other nested element'}},
    ]}

Here is the xml2json function.

def xml2json(x):
    def get_attributes(atts):
        atts = dict(atts)
        d = {}
        for k, v in atts.items():
            d[k] = v.value
        return d

    def get_children(n, d):
        tmp = {}
        d.setdefault(n.nodeName, {})
        if n.attributes:
            tmp['attributes'] = get_attributes(n.attributes)
        if n.hasChildNodes():
            for c in n.childNodes:
                if c.nodeType == c.TEXT_NODE or c.nodeName == '#cdata-section':
                    tmp['text'] = c.data
                else:
                    children = get_children(c, {})
                    for ck, cv in children.items():
                        if ck in d[n.nodeName]:
                            if not isinstance(d[n.nodeName][ck], list):
                                tmpv = d[n.nodeName][ck]
                                d[n.nodeName][ck] = []
                                d[n.nodeName][ck].append(tmpv)
                            d[n.nodeName][ck].append(cv)
                        else:
                            d[n.nodeName][ck] = cv

        for tk, tv in tmp.items():
            d[n.nodeName][tk] = tv

        return d

    return get_children(x.firstChild, {})

Here is the searchjson function.

def searchjson(sobj, reg):
    import re
    results = []
    if isinstance(sobj, basestring):
        # search the string and return the output
        if re.search(reg, sobj):
            results.append(sobj)
    else:
        # dictionary
        for k, v in sobj.items():
            newv = v
            if not isinstance(newv, list):
                newv = [newv]

            for elem in newv:
                has_attributes = False
                if isinstance(elem, dict):
                    has_attributes = bool(elem.get('attributes', False))
                res = searchjson(elem, reg)
                res = [] if not res else res
                for r in res:
                    r_is_dict = isinstance(r, dict)
                    r_no_attributes = r_is_dict and 'attributes' not in r.keys()
                    if has_attributes and r_no_attributes :
                        r.update({'attributes': elem.get('attributes', {})})

                    results.append({k: r})

    return results

The search function I created after reading your question. It hasn't been 100% tested and probably has a few bugs, but I think it would be a good start for you. As for what you're looking for, it searches nested elements, attributes, using wildcards. It also returns the id of the elements.

You can use the function like so, where xml is the xml object to search and reg is a regex pattern string to search for, ex: 'other', 'oth.*', '.the.' will all find the elements with "other" in them.

json = xml2json(xml)
results = searchjson(json, reg='other')

results will be a list of dictionaries.

Hope it helps.

I'm not familiar with lxml, if anyone else knows please let us know.

user177800 · Accepted Answer · 2011-04-28 19:51:13Z

0

For single large files accessed by a Python script, you should look at Xapian it is a full featured full text indexing and search engine, that is mature and robust and has wonderful first class Python bindings. It works with Python like it was written in Python, no external servers to run or any silliness like that.

If you don't need to persist the indexes, and can use the in memory database it will be extremely fast. It is faster than Lucene based solutions and uses a tiny fraction of the resources.

answered Apr 28, 2011 at 19:51

user177800

Collectives™ on Stack Overflow

Full text searching XML data with Python: best practices, pros & cons

6 Answers 6

2 Comments

6 Comments

2 Comments

Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

2 Comments

6 Comments

2 Comments

Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related