0

It has been a while since my sql days thus I am wondering whether it is possible to further optimize the following query? The goal is to collect all accounts for each accountant including all the bookings and movements associated with it. Performance/query-time is very important since there are '3 digit'-million datasets...

select Accountant.person_id, 
    Account.account_id, 
    Account.number, 
    Account.balance,
    Account_Type.type_number, 
    Booking.booking_id, 
    Booking.amount, 
    Movement.movement_date, 
    Movement.movement_desc
from Accountant
join Account on Accountant.person_id = Account.person_id
join Account_Type on Account.account_type_id = Account_Type.account_type_id
left outer join Booking on Account.account_id = Booking.account_id
left outer join Movement on Booking.movement_id = Movement.movement_id;

The entity model looks something like that: enter image description here

UPDATE: Since some of you are wondering: Yes I am simply selecting hundreds of million of rows since the query is used to migrate data. The data queried is used to construct a new data structure which is put in another database...

  1. To allow returning accounts with no Bookings I added the outer joins. Is that the right syntax?
  2. Here is the explain plan - seeing any optimization possibilities? enter image description here
  3. After adding some missing indices the query takes (in an external tool) about 1/2 hour. In java a memory error is thrown at some point. Any hints (except increasing memory) how to optimize that?
5
  • 3
    How about using a proper JOIN? Commented Mar 18, 2014 at 15:04
  • 1
    for staters, it (probably) will be obvious to you that you missed something Commented Mar 18, 2014 at 15:09
  • Is correct cartesian product? Because Booking and Movement are not linkd to Accountant, account, account_type Commented Mar 18, 2014 at 15:10
  • 1
    But booking isn't linked to account in your where clause. Your diagram shows four relationships, so only having three in the where clause is a bit of a red flag. It would be much more obvious with ANSI joins. (With no actual filters, just join conditions, hopefully you're expecting a large result set; sounds like you expect all data from all five tables; not sure how useful that is!) Commented Mar 18, 2014 at 15:17
  • We need a lot more information before we can help you optimize the query. What is the real query (I assume you will not simply select hundreds of millions of rows?), how long does it take, etc. Commented Mar 18, 2014 at 17:24

2 Answers 2

3

As others have already mentioned: It is strange, you want to select hundreds of millions of records in one go.

Aside from that:

  1. The left outer join on table Booking will only work, if you also outer join table Movement.
  2. As you want all records, only full table scans (or fast full index scans for that matter) make sense. Check the explain plan if this is the case. (It should.) Otherwise use /*+full(tablename)*/.
  3. As you will use full table scans you may want to have them run in parallel. Check the explain plan if this is already the case. Otherwise use /*+parallel(tablename,factor)*/.
  4. In case the tables have many columns, it might be good to have indexes containing the desired columns, so fast full index scans instead of full table scans can be used and less disc blocks need to be read thus.
  5. You can reduce disc reads by compressing tables (Oracle 11g and up).
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks for the tip with the explain plan - I added it above. Do you see any optimization possibilities?
The explain plan looks fine to me. Only full table and fast full index scans, except for account_type which can be treated as a small lookup list. Moreover hash joins are being used, which seams the appropriate method for big data. Now you could use the parallel hint to speed things up in case you have several CPUs available. However, you added the remark that the statement runs for half an hour now and you need the complete data to move it to another system. So why bother at all if it's half an hour or not? You need it just once.
I tried the following: select /*+ parallel(8) */ but the query needed longer. Now I want to try select /*+ parallel(auto) */. Any hints on how I can specify the 'parallel' statement?
This should do it: select /*+parallel(account,8)*/ . You could add other tables to that parallel list, but as account is the central table the parallel reading should be automativcally applied to the other tables anyhow. It's contagious ;-)
2

The join syntax is clearer and recommended. I don't think you can optimise the query itself any further, but you could check the DB schema, or investigate alternatives.

DB Schema:

  • are all IDs marked primary (i.e.indexed and clustered)?
  • can you check the query plans to ensure the query is being executed optimally?

You could also consider a reporting table that is built at the end of every day (or more frequently if required) - and query from that. Oracle supports materialised views that might help.

Finally, a query that returns n,000,000 records really isn't much use to anyone, so you might reconsider your logic. If you're presenting these records to a user, you need to page them (so limit the query to a page size) - or if you're presenting summary info, rewrite the query to use aggregate functions (sum/max/avg etc.) Don't use code to do what a DB does best.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.