Apache Lucene: Creating an index between strings and doing intelligent searching

Question

My problem is as follows: Let's say I have three files. A, B, and C. Each of these files contains 100-150M strings (one per line). Each string is in the format of a hierarchical path like /e/d/f. For example:

File A (RTL):
/arbiter/par0/unit1/sigA
/arbiter/par0/unit1/sigB
...
/arbiter/par0/unit2/sigA

File B (SCH)
/arbiter_sch/par0/unit1/sigA
/arbiter_sch/par0/unit1/sigB
...
/arbiter_sch/par0/unit2/sigA

File C (Layout)
/top/arbiter/par0/unit1/sigA
/top/arbiter/par0/unit1/sigB
...
/top/arbiter/par0/unit2/sigA

We can think of file A corresponding to circuit signals in a hardware modeling language. File B corresponding to circuit signals in a schematic netlist. File C corresponding to circuit signals in a layout (for manufacturing).

Now a signal will have a mapping between File A <-> File B <-> File C. For example in this case, /arbiter/par0/unit1/sigA == /arbiter_sch/par0/unit1/sigA == /top/arbiter/par0/unit1/sigA. Of course, this association (equivalence) is established by me, and I don't expect the matcher to figure this out for me.

Now say, I give '/arbiter/par0/unit1/sigA'. In this case, the matcher should return a direct match from file A since it is found. For file B/C a direct match is not possible. So it should return the best possible matches (i.e., edit distance?) So in this example, it can give /arbiter_sch/par0/unit1/sigA from file B and /top/arbiter/par0/unit1/sigA from file C.

Instead of giving a full string search, I could also give something like *par0*unit1*sigA and it should give me all the possible matches from fileA/B/C.

I am looking for solutions, and came across Apache Lucene. However, I am not totally sure if this would work. I am going through the docs to get some idea.

My main requirements are the following:

There will be 3 text files with full path to signals. (I can adjust the format to make it more compact if it helps building the indexer more quickly).
Building the index should be fairly fast (take a couple of hours). The files above are static (no modifications).
Searching should be comprehensive. It is OK if it takes ~1s / search but the matching should support direct match, regex match, and edit distance matching. The main challenge is each file can have 100-150 million signals.

Can someone tell me if such a use case can be easily addressed by Lucene? What would be the correct way to go about building a index and doing quick/fast searching? I would like to write some proof-of-concept code and test the performance. Thanks.

can you give examples like what you wanna search for and what would be the expected searchresult? — dom
– dom, Commented Jun 1, 2018 at 10:57
and for proper understanding: a/b/c has nothing to do with A,B and C right? — dom
– dom, Commented Jun 1, 2018 at 11:25
Thank you for the response. I've rephrased the original post to give better clarity. Hope this helps. — user4979733
– user4979733, Commented Jun 1, 2018 at 22:27
so this files will be indexed once? and after that you like to search stuff? no reindexing or adhoc indexing because of changes on the files? — dom
– dom, Commented Jun 4, 2018 at 7:44
Yes, you are correct. The files will be indexed only once. The content is static. It cannot change once it is frozen. — user4979733
– user4979733, Commented Jun 4, 2018 at 21:27

dom · Accepted Answer · 2018-06-05 12:08:06Z

1

i think based on your requirements the best solution would be a PoC with a given test set of entries. Based on this it should be possible to evaluate the target indexing time you like to achieve. Because you only use static informations it's easier, because do don't have to care about topics like NRT (near-real-time searches).

Personally i never used lucene for such a big information set but i think lucene is able to handle this.

How i would do it:

Read tutorials and best practices about lucene, indexing, searching and understand how it works

Define an data set for indexing lets say 1000 lines for each file

Define your lucene document structure

this is really important because based on this you will apply your searches. take care about analyzer tasks like tokanization if needed and how. If you need fulltext search care about a TextField.

Write code for simple indexing

Run small tests with indexing and inspect your index with Luke

Write code for simple searching

Define queries and your expected results. execute searches and check results.

Try to structure your code. separate indexing and searching -> it will be easier to refactor.

answered Jun 5, 2018 at 12:08

dom

7631 gold badge8 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user4979733 Over a year ago

Thanks for the help. I will try this out and see how it goes.

dom Over a year ago

sure. if you face any problem just drop a comment and we can talk about that maybe on chat.

Collectives™ on Stack Overflow

Apache Lucene: Creating an index between strings and doing intelligent searching

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related