0

I have a list of words I need to search for (say a few thousand entries of varchar2 of no more than 30 characters), I need to search for the presence of these words in sentences (say about a hundred million entries of varchar2 of no more than 256 characters). I would like to get the id of the text with a least one word matching and ideally a list of indexes giving the positions of the searched words.

ID searched words
1 pluto
2 jupiter
ID sentences
1 we go back to earth
2 we discover pluto and jupiter

would give back the results

minimum results
2
ideal results
2, ( (1, 13), (2, 23))

While this is something that can be developed, it feels like it is a common SQL request. Hence I wonder if there are best practices to do it or even better if there is a dedicated function in oracle SQL starting from 19c or PL/SQL that would do such a thing in an efficient way. It seems that Oracle Text CONTAINS and ACCUMulate would work but I am not sure I can use Oracle Text in my context and if this would be typically slower or faster than a pure SQL PL/SQL request.

1
  • 1
    Gotta admit I've never been a big fan of de-normalizing RDBMS data like this into one field. But without going into why... ListAgg, InStr, CrossJoin should do what you need. assuming you can handle the memory/processing requirements for that volume of records. Commented Feb 16, 2022 at 15:01

1 Answer 1

1

I make no claims to performance and I wouldn't run this in a PRODUCTION ENVIORNMENT until vetted and load/performance impacts considered.

  1. I use 2 CTE's to simulate your data (SearchWords and Sentences)
  2. I use instr() to find the position of each word in a sentence
  3. I use listAgg() to combine the data into one row for each word found in a sentence.
  4. I only return occurrences where a word is found in a sentence
  5. I use CROSS JOIN so each search word related to every sentence (this could get UGLY in terms of memory usage CPU etc as the data set will be huge) thousands of words times hundreds of millions of sentences...

This is likely better done using text searches but I'm not sure how I'd get the data format you are looking for that way... shrug if it's a one time thing and you have the time to wait.... and it's in an environment where you won't bring down production....

DEMO: https://dbfiddle.uk/?rdbms=oracle_21&fiddle=77e0b8d9373ee1abc14cf10342c45767

with SearchWords as (SELECT 1 ID , 'pluto' SearchWord from dual UNION ALL
                     SELECT 2, 'jupiter' from dual),
     Sentences as (SELECT 1 ID, 'we go bacvk to earth' sentence from dual UNION ALL
                   SELECT 2, 'we discover pluto and jupiter' from dual),
                   
  Step1 as (SELECT S.ID, LISTAGG('(' || W.ID || ',' || instr(S.Sentence,W.SearchWord) || ')', ',')  
                 WITHIN GROUP (ORDER BY W.ID) Result
               FROM Sentences S
               CROSS JOIN SearchWords W
               WHERE instr(S.Sentence,W.SearchWord)>0
               GROUP BY S.ID)
SELECT * FROM Step1

Really don't need the step1 CTE... but I wasn't sure if It was going to work out of the gate.

Giving us:

+----+---------------+
| ID |    RESULT     |
+----+---------------+
|  2 | (1,13),(2,23) |
+----+---------------+

If needed: You could subdivide the sentences into processing groups to processes some then union in more etc... to manage the hit. But if your environment is sufficiently large it may be able to handle it in one go.

Sign up to request clarification or add additional context in comments.

3 Comments

Thank you for the detailed answer. This is what I was thinking to aim for. And I share the fear that the cross join is going to bring everything to their knees. Because unfortunately, it will a recurring task. I guess my next steps will be to measure that as the baseline and be creative and find other options to compare with.
Maybe segment the sentences into smaller groups and loop writing the results to a table. Process 1 mil at a time X times. or 10 mil at a time X times... At least the load would be reduced to the thousands times the 1-10 mil sentences. Giving you a chance to manage the scale/load
I thought about about unsetting each sentence splitting them into rows for each word setting it up as a materialized view/table. Then it just becomes an INNER JOIN to find matches on which you could have an index. Performance wise this would likely be faster but 200 words * 100 records million didn't appeal to me as a data set. Though I'm still thinking as a partitioned dataset it might be worth it. I might go this route if performance is horrid. and treat each starting letter of each word of the alphabet as it's own table partition.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.