2

I have two SQL statements whose performance I expect to be similar, but in fact SQL1 used 0.065 seconds and SQL2 used over 10 seconds with just 8000 records in total. Could anyone help to explain this? How can I optimize SQL2?

SQL 1:

select
    job_id,
    JOB_DESCRIPTION,
    REGEXP_COUNT(JOB_Description, '(ABC|DEF)([[:digit:]]){5}') as occurrences 
from smms.job 
where TO_NUMBER(to_char(CREATE_DATE,'YYYY')) = 2017;

SQL 2:

select job_id, JOB_Description 
from (
    select 
        job_id, 
        JOB_DESCRIPTION,
        REGEXP_COUNT(JOB_Description, '(ABC|DEF)([[:digit:]]){5}') as occurrences 
    from smms.job 
    where TO_NUMBER(to_char(CREATE_DATE,'YYYY')) = 2017
) 
where occurrences > 0;
11
  • 3
    Although this is somewhat separate from your question, why does "SQL 2" use the subquery? occurrences is not used in the final select list, so I don't see why you don't just use REGEXP_COUNT directly in the WHERE clause. i.e. SELECT job_id, job_description FROM sims.job WHERE TO_NUMBER... =2017 AND REGEXP_COUNT... > 0 Commented Aug 27, 2018 at 4:10
  • 1
    What are the execution plans of the queries? What will happen if TO_NUMBER(to_char(CREATE_DATE,'YYYY'))=2017 would be changed to CREATE_DATE >= DATE'2017-01-01' AND CREATE_DATE < DATE'2018-01-01'? Commented Aug 27, 2018 at 4:15
  • The clause you mentioned is my first version, but it also takes over 10 seconds, that's why I tried difference clause to optimize it. Commented Aug 27, 2018 at 4:26
  • I guess that version 1 is able to restrict the rather expensive regexp operation to a smaller resultset - the execution plans for both versions should show the difference (or maybe a sql trace with event 10046). Is there an index on create_date? Commented Aug 27, 2018 at 6:00
  • SQL3: select count(*) from smms.job where REGEXP_COUNT(JOB_Description, '(ABC|DEF)([[:digit:]]){5}')>0, it takes 10 seconds as well, which means the performance has nothing to do with CREATE_DATE filter, it wholly depends on the REGEXP_COUNT clause. Commented Aug 27, 2018 at 6:13

3 Answers 3

1

thinking again about the information I guess the two strategies are:

SQL 1:

  • Filter the rows with TO_NUMBER(to_char(CREATE_DATE,'YYYY')) = 2017
  • use the function REGEXP_COUNT(JOB_Description, '(ABC|DEF)([[:digit:]]){5}') on the resulting rows

SQL 2:

  • use the function REGEXP_COUNT(JOB_Description, '(ABC|DEF)([[:digit:]]){5}') on all rows
  • filter the result with TO_NUMBER(to_char(CREATE_DATE,'YYYY')) = 2017

Since regexp functions are very expensive in Oracle this could explain the difference in performance.

Version 2 could be optimized with hints - for example with MATERIALIZE, if you add a CTE.

Sign up to request clarification or add additional context in comments.

1 Comment

Yes, you are correct! if I use /*+ materialize */ on TO_NUMBER(to_char(CREATE_DATE,'YYYY')) = 2017 with a CTE first, then apply the REGEXP_COUNT, it greatly improves the performance, without the materialize hints, no matter how I reorganize the filter clauses, the execution plan and processing time are the same. Thank you so much!
0

As pointed out from Martin the issue is the expensive regexp_count function. So reducing the question is:

Why is:

  select * from (
  with dat as (select level lv, rpad('X',500,'X') txt from dual connect by level <= 20000)
  select lv, 
         REGEXP_COUNT(txt, '(ABC|DEF)([[:digit:]]){5}') as occurrences 
  from   dat 
  --where  REGEXP_COUNT(txt, '(ABC|DEF)([[:digit:]]){5}') > 1
  ) where rownum > 1

0.019 seconds and

  select * from (
  with dat as (select level lv, rpad('X',500,'X') txt from dual connect by level <= 20000)
  select lv, 
         REGEXP_COUNT(txt, '(ABC|DEF)([[:digit:]]){5}') as occurrences 
  from   dat 
  where  REGEXP_COUNT(txt, '(ABC|DEF)([[:digit:]]){5}') > 1
  ) where rownum > 1

6.7 seconds. Oracle evaluates the regexp_count in both executions. So there must be a difference in the evaluation in the where part and in the select part.

2 Comments

I don't quite understand your point, but rownum > 1 can materialize the SQL, thus improving the performance.
rownum > 1 is not to improve the performance. It's to check the time a statement needs without the need to go to the end of the cursor. So in query 1 and 2 is done the same and in 2 it's factors slower.
0

At SQL1 it filters by (TO_NUMBER(to_char(CREATE_DATE,'YYYY')) = 2017) For the rows returned, executes (REGEXP_COUNT) per row

At SQL2 it filters by the result of (REGEXP_COUNT) which means that executes it against all table rows. Then, on that result, filters by (TO_NUMBER(to_char(CREATE_DATE,'YYYY')) = 2017)

To prove this, execute SQL1 without the filter. It will take approximately as much time as SQL2, maybe a little more.

To optimize you need to be 100% sure it will take SQL1 filter first. An absolute way would be to execute SQL1 and get the results into a temporary/memory table, then filter on them SQL2 filter

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.