multiple string search in oracle

Question

I have a list of words I need to search for (say a few thousand entries of varchar2 of no more than 30 characters), I need to search for the presence of these words in sentences (say about a hundred million entries of varchar2 of no more than 256 characters). I would like to get the id of the text with a least one word matching and ideally a list of indexes giving the positions of the searched words.

ID	searched words
1	pluto
2	jupiter

ID	sentences
1	we go back to earth
2	we discover pluto and jupiter

would give back the results

minimum results
2

ideal results
2, ( (1, 13), (2, 23))

While this is something that can be developed, it feels like it is a common SQL request. Hence I wonder if there are best practices to do it or even better if there is a dedicated function in oracle SQL starting from 19c or PL/SQL that would do such a thing in an efficient way. It seems that Oracle Text CONTAINS and ACCUMulate would work but I am not sure I can use Oracle Text in my context and if this would be typically slower or faster than a pure SQL PL/SQL request.

Gotta admit I've never been a big fan of de-normalizing RDBMS data like this into one field. But without going into why... ListAgg, InStr, CrossJoin should do what you need. assuming you can handle the memory/processing requirements for that volume of records. — xQbert
– xQbert, Commented Feb 16, 2022 at 15:01

xQbert · Accepted Answer · 2022-02-16 15:05:55Z

1

I make no claims to performance and I wouldn't run this in a PRODUCTION ENVIORNMENT until vetted and load/performance impacts considered.

I use 2 CTE's to simulate your data (SearchWords and Sentences)
I use instr() to find the position of each word in a sentence
I use listAgg() to combine the data into one row for each word found in a sentence.
I only return occurrences where a word is found in a sentence
I use CROSS JOIN so each search word related to every sentence (this could get UGLY in terms of memory usage CPU etc as the data set will be huge) thousands of words times hundreds of millions of sentences...

This is likely better done using text searches but I'm not sure how I'd get the data format you are looking for that way... shrug if it's a one time thing and you have the time to wait.... and it's in an environment where you won't bring down production....

DEMO: https://dbfiddle.uk/?rdbms=oracle_21&fiddle=77e0b8d9373ee1abc14cf10342c45767

with SearchWords as (SELECT 1 ID , 'pluto' SearchWord from dual UNION ALL
                     SELECT 2, 'jupiter' from dual),
     Sentences as (SELECT 1 ID, 'we go bacvk to earth' sentence from dual UNION ALL
                   SELECT 2, 'we discover pluto and jupiter' from dual),
                   
  Step1 as (SELECT S.ID, LISTAGG('(' || W.ID || ',' || instr(S.Sentence,W.SearchWord) || ')', ',')  
                 WITHIN GROUP (ORDER BY W.ID) Result
               FROM Sentences S
               CROSS JOIN SearchWords W
               WHERE instr(S.Sentence,W.SearchWord)>0
               GROUP BY S.ID)
SELECT * FROM Step1

Really don't need the step1 CTE... but I wasn't sure if It was going to work out of the gate.

Giving us:

+----+---------------+
| ID |    RESULT     |
+----+---------------+
|  2 | (1,13),(2,23) |
+----+---------------+

If needed: You could subdivide the sentences into processing groups to processes some then union in more etc... to manage the hit. But if your environment is sufficiently large it may be able to handle it in one go.

edited Feb 16, 2022 at 15:05

answered Feb 16, 2022 at 14:34

xQbert

35.5k2 gold badges46 silver badges67 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

call me Steve Over a year ago

Thank you for the detailed answer. This is what I was thinking to aim for. And I share the fear that the cross join is going to bring everything to their knees. Because unfortunately, it will a recurring task. I guess my next steps will be to measure that as the baseline and be creative and find other options to compare with.

xQbert Over a year ago

Maybe segment the sentences into smaller groups and loop writing the results to a table. Process 1 mil at a time X times. or 10 mil at a time X times... At least the load would be reduced to the thousands times the 1-10 mil sentences. Giving you a chance to manage the scale/load

xQbert Over a year ago

I thought about about unsetting each sentence splitting them into rows for each word setting it up as a materialized view/table. Then it just becomes an INNER JOIN to find matches on which you could have an index. Performance wise this would likely be faster but 200 words * 100 records million didn't appeal to me as a data set. Though I'm still thinking as a partitioned dataset it might be worth it. I might go this route if performance is horrid. and treat each starting letter of each word of the alphabet as it's own table partition.

Collectives™ on Stack Overflow

multiple string search in oracle

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related