My problem is as follows: Let's say I have three files. A, B, and C. Each of these files contains 100-150M strings (one per line). Each string is in the format of a hierarchical path like /e/d/f. For example:
File A (RTL):
/arbiter/par0/unit1/sigA
/arbiter/par0/unit1/sigB
...
/arbiter/par0/unit2/sigA
File B (SCH)
/arbiter_sch/par0/unit1/sigA
/arbiter_sch/par0/unit1/sigB
...
/arbiter_sch/par0/unit2/sigA
File C (Layout)
/top/arbiter/par0/unit1/sigA
/top/arbiter/par0/unit1/sigB
...
/top/arbiter/par0/unit2/sigA
We can think of file A corresponding to circuit signals in a hardware modeling language. File B corresponding to circuit signals in a schematic netlist. File C corresponding to circuit signals in a layout (for manufacturing).
Now a signal will have a mapping between File A <-> File B <-> File C. For example in this case, /arbiter/par0/unit1/sigA == /arbiter_sch/par0/unit1/sigA == /top/arbiter/par0/unit1/sigA. Of course, this association (equivalence) is established by me, and I don't expect the matcher to figure this out for me.
Now say, I give '/arbiter/par0/unit1/sigA'. In this case, the matcher should return a direct match from file A since it is found. For file B/C a direct match is not possible. So it should return the best possible matches (i.e., edit distance?) So in this example, it can give /arbiter_sch/par0/unit1/sigA from file B and /top/arbiter/par0/unit1/sigA from file C.
Instead of giving a full string search, I could also give something like *par0*unit1*sigA and it should give me all the possible matches from fileA/B/C.
I am looking for solutions, and came across Apache Lucene. However, I am not totally sure if this would work. I am going through the docs to get some idea.
My main requirements are the following:
There will be 3 text files with full path to signals. (I can adjust the format to make it more compact if it helps building the indexer more quickly).
Building the index should be fairly fast (take a couple of hours). The files above are static (no modifications).
Searching should be comprehensive. It is OK if it takes ~1s / search but the matching should support direct match, regex match, and edit distance matching. The main challenge is each file can have 100-150 million signals.
Can someone tell me if such a use case can be easily addressed by Lucene? What would be the correct way to go about building a index and doing quick/fast searching? I would like to write some proof-of-concept code and test the performance. Thanks.