Python file indexing and searching

Question

I have a large set off files (hdf) that I need to enable search for. For Java I would use Lucene for this, as it's a file and document indexing engine. I don't know what the python equivalent would be though.

Can anyone recommend which library I should use for indexing a large collection of files for fast search? Or is the prefered way to roll your own?

I have looked at pylucene and lupy, but both projects seem rather inactive and unsupported, so I am not sure if should rely on them.

Final notes: Woosh and pylucene seems promising, but woosh is still alpha so I am not sure I want to rely on it, and I have problems compiling pylucene, and there are no actual releases off it. After I have looked a bit more at the data, it's mostly numbers and default text strings, so as off now an indexing engine won't help me. Hopefully these libraries will stabilize and later visitors will find some use for them.

I used Whoosh in 2018 and it was solid :) Used it to index several thousand resumes. — Kim
– Kim, Commented Sep 2, 2022 at 2:14

A. Coady · Accepted Answer · 2009-02-10 18:51:31Z

9

Lupy has been retired and the developers recommend PyLucene instead. As for PyLucene, its mailing list activity may be low, but it is definitely supported. In fact, it just recently became an official apache subproject.

You may also want to look at a new contender: Whoosh. It's similar to lucene, but implemented in pure python.

answered Feb 10, 2009 at 18:51

A. Coady

57.9k8 gold badges38 silver badges40 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

batbrat · Accepted Answer · 2009-02-10 13:42:54Z

5

I haven't done indexing before, however the following may be helpful :-

pyIndex - http://rgaucher.info/beta/pyIndex/ -- File indexing library for Python
http://www.xml.com/pub/a/ws/2003/05/13/email.html -- Thats a script for searching Outlook email using Python and Lucene
http://gadfly.sourceforge.net/ - Aaron water's gadfly database (I think you can use this one for indexing. Haven't used it myself.)

As far as using HDF files goes, I have heard of a module called h5py.

I hope this helps.

answered Feb 10, 2009 at 13:42

batbrat

5,2413 gold badges34 silver badges38 bronze badges

2 Comments

Staale Over a year ago

I can read the hdf5 files fine using pytables, I just need to find the right tool to index the information I extract.

batbrat Over a year ago

I have little experience in the area. Since you can already read hd5 files, I think that pyIndexer might work for you. I have little experience in the area and I hope your project works out well.

Seb · Accepted Answer · 2009-02-10 13:57:01Z

4

I'd suggest Sphinx. It's very active, has much more features and seems faster than Lucene.

answered Feb 10, 2009 at 13:57

Seb

17.9k7 gold badges40 silver badges27 bronze badges

1 Comment

Gregg Lind Over a year ago

Sphinx is great, and IMHO, easier to install, configure etc, than pylucene.

Rob Young · Accepted Answer · 2009-04-20 21:08:59Z

2

A popular C++ based information retrieval library that is often used with Python is Xapian http://xapian.org/

It's incredibly quick and can happily manage large amounts of data, however it's not quite as easily extensible as Lucene.

answered Apr 20, 2009 at 21:08

Rob Young

1,24511 silver badges19 bronze badges

Comments

Saurabh · Accepted Answer · 2019-04-06 06:52:51Z

0

Elastic search can be used to index documents and search by keywords
Elasticsearch can be integrated with graph db and hadoop as well Some urls below:
1) https://www.elastic.co/products/elasticsearch
2) https://towardsdatascience.com/getting-started-with-elasticsearch-in-python-c3598e718380

answered Apr 6, 2019 at 6:52

Saurabh

7,8934 gold badges50 silver badges46 bronze badges

Collectives™ on Stack Overflow

Python file indexing and searching

5 Answers 5

Comments

2 Comments

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

2 Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related