13

Task

I want to use Python for doing full text searches of XML data.

Example data

<elements>
  <elem id="1">some element</elem>
  <elem id="2">some other element</elem>
  <elem id="3">some element
    <nested id="1">
    other nested element
    </nested>
  </elem>
</elements>

Basic functionality

The most basic functionality I want is that a search for "other" in an XPath ("/elements/elem") returns at least the value of the ID attribute for the matching element (elem 2) and nested element (elem 3, nested 1) or the matching XPaths.

Ideal functionality

The solution should be flexible and scalable. I am looking for possible combinations of these features:

  • search nested elements (infinite depth)
  • search attributes
  • search for sentences and paragraphs
  • search using wildcards
  • search using fuzzy matching
  • return precise matching info
  • good search speed for large XML files

Question

I don't expect a solution with all of the ideal functionality, I'll have to combine different existing functionalities and code the rest myself. But first I would like to know more about what there is out there, which libraries and approaches you would usually use for this, what their pros and cons are.

EDIT: Thanks for the answers so far, I added detail and started a bounty.

4
  • XQuery and XPath Full Text 1.0 Commented Apr 26, 2011 at 16:15
  • @Dimitre: Full Text has scoring, also. Commented Apr 27, 2011 at 20:18
  • Lxml really is the only way to go. Commented Apr 28, 2011 at 19:41
  • Good question, +1. See my answer for the two most powerful and standardized solutions. Commented Apr 28, 2011 at 19:41

6 Answers 6

6

Not sure if that will be enough for your needs, but lxml has support for regular expressions in xpath (meaning: you can use xpath 1.0 plus the EXSLT extension functions for regular expressions)

Compared to the feature list that was added later:

  • search nested elements (infinite depth): yes
  • search attributes: yes
  • search for sentences and paragraphs: no. Assuming that "paragraphs" are actual xml elements, then yes. But "sentences" as such, no.
  • search using wildcards: yes (regular expressions)
  • search using fuzzy matching: no (assuming stemming, synonyms and so on...)
  • return precise matching info: yes
  • good search speed for large XML files: yes, except when your files are so extremely large that you would actually need a fulltext index to get good speed anyway.

The only way to satisfy all your request that I see, would be to load your files into a native xml database that supports "real" fulltext search (via XQuery Fulltext probably) and use that. (can't help you much further with that, maybe Sedna, which seems to have a python API and seems to supports fulltext search?)

Sign up to request clarification or add additional context in comments.

2 Comments

@Jakob Bowyer: lxml is nice, but why is it the only way to go? How does it compare to options in the other answers?
@locodesportif: It doesn't get close to XPath 2.0, for example.
2

I think you would be best served using a full text search engine like Solr: http://lucene.apache.org/solr/

What you can do is store a "document" in solr for each <elem /> in your xml. You can store any data you like in the document. Then you can search against the index and grab the id field stored in the matching documents. This will be very fast for a large set of documents.

6 Comments

What's the overhead of Solr? Is there any Python integration?
Solr is an indexing and search server. As for overhead, it depends on what you mean by that. If you want to search a large amount of these documents, using a full text search engine is really your best option. If you just have a few, it might not be worth it. With Solr you have to run a server, specify the format of the documents, and send them to it to be indexed. There is python integration, but it's also a simple api that responds with JSON so you can use it from anywhere. I highly recommend it!
Thanks Max, this is something I'll keep in mind. But I need a more simple, easier to deploy, search solution for single, large XML files accessed by a Python script.
Solr is an extension of Lucence which is Java. Huge footprint and complexity for something as simple as these requirements.
I don't agree. His ideal requirements seem better suited for a full text search engine than a regex.
|
1
select="/elements/elem//[contains(.,'other')]"

see also xpath: find a node that has a given attribute whose value contains a string

2 Comments

What are the constraints of this? Can I search for text that includes linebreaks and use wildcards?
contains() is just substring search. If you need 'wildcards', use matches().
1

I'd recommend the following two:

Use XPath 2.0. It supports regular expressions.

Or,

Use XQuery and XPath (2.0) Full Text, which has even more powerful features.

Comments

1

So, recently I had to create a XML to JSON converter. It doesn't conform exactly to the JSON standard, but it comes pretty close. The xml2json function returns a dictionary representation of the xml object. All element attributes are included in a dictionary with a key of attributes and element text are included in the text key.

For example, your xml object would look like this after its conversion:

json = {'elements': 
    {'elem': [
        {'attributes': {'id', '1'}, 'text': 'some element'},
        {'attributes': {'id', '2'}, 'text': 'some other element'},
        {'attributes': {'id', '3'}, 'text': 'some element', 'nested': {
            'attributes': {'id', '1'}, 'text': 'other nested element'}},
    ]}

Here is the xml2json function.

def xml2json(x):
    def get_attributes(atts):
        atts = dict(atts)
        d = {}
        for k, v in atts.items():
            d[k] = v.value
        return d

    def get_children(n, d):
        tmp = {}
        d.setdefault(n.nodeName, {})
        if n.attributes:
            tmp['attributes'] = get_attributes(n.attributes)
        if n.hasChildNodes():
            for c in n.childNodes:
                if c.nodeType == c.TEXT_NODE or c.nodeName == '#cdata-section':
                    tmp['text'] = c.data
                else:
                    children = get_children(c, {})
                    for ck, cv in children.items():
                        if ck in d[n.nodeName]:
                            if not isinstance(d[n.nodeName][ck], list):
                                tmpv = d[n.nodeName][ck]
                                d[n.nodeName][ck] = []
                                d[n.nodeName][ck].append(tmpv)
                            d[n.nodeName][ck].append(cv)
                        else:
                            d[n.nodeName][ck] = cv

        for tk, tv in tmp.items():
            d[n.nodeName][tk] = tv

        return d

    return get_children(x.firstChild, {})

Here is the searchjson function.

def searchjson(sobj, reg):
    import re
    results = []
    if isinstance(sobj, basestring):
        # search the string and return the output
        if re.search(reg, sobj):
            results.append(sobj)
    else:
        # dictionary
        for k, v in sobj.items():
            newv = v
            if not isinstance(newv, list):
                newv = [newv]

            for elem in newv:
                has_attributes = False
                if isinstance(elem, dict):
                    has_attributes = bool(elem.get('attributes', False))
                res = searchjson(elem, reg)
                res = [] if not res else res
                for r in res:
                    r_is_dict = isinstance(r, dict)
                    r_no_attributes = r_is_dict and 'attributes' not in r.keys()
                    if has_attributes and r_no_attributes :
                        r.update({'attributes': elem.get('attributes', {})})

                    results.append({k: r})

    return results

The search function I created after reading your question. It hasn't been 100% tested and probably has a few bugs, but I think it would be a good start for you. As for what you're looking for, it searches nested elements, attributes, using wildcards. It also returns the id of the elements.

You can use the function like so, where xml is the xml object to search and reg is a regex pattern string to search for, ex: 'other', 'oth.*', '.the.' will all find the elements with "other" in them.

json = xml2json(xml)
results = searchjson(json, reg='other')

results will be a list of dictionaries.

Hope it helps.

2 Comments

Good idea. Can this be implemented using lxml, too? How?
I'm not familiar with lxml, if anyone else knows please let us know.
0

For single large files accessed by a Python script, you should look at Xapian it is a full featured full text indexing and search engine, that is mature and robust and has wonderful first class Python bindings. It works with Python like it was written in Python, no external servers to run or any silliness like that.

If you don't need to persist the indexes, and can use the in memory database it will be extremely fast. It is faster than Lucene based solutions and uses a tiny fraction of the resources.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.