3

I have a large (long) table of data stored in SQLite - potentially 5m+ entries. I am using the System.Data.SQLite package to execute my query and read the data into a bespoke in-memory collection structure in the regular ADO.net way.

CODE (F#)

use cnxn = new SQLiteConnection(@"Data Source=C:\Temp\test.db;Version=3;Read Only=True;")
cnxn.Open()

let data = ResizeArray<Data>()

let cmd = new SQLiteCommand(@"SELECT X, Y, Z, AAA FROM Data", cnxn)
let reader = cmd.ExecuteReader()

while reader.Read() do
    let d = {X = reader.GetInt32(0); Y = reader.GetInt32(1); 
                    Z = reader.GetInt32(2); AAA = reader.GetDouble(3)}
    data.Add(d)
cnxn.Close()

Questions

  1. Is System.Data.SQLite the most performant library to be using for the job here? I am only using it because it appears to be the standard

  2. Is their a better way to code this up?

  3. Are there any settings/configurations on the database itself that would help this scenario?

Why do I think this should be able to go faster?

My computer has a theoretical read speed of 725 mb/s (SSD). Reading the sqlite above I am reading 40mb in 1s which is an effective actual speed of 40 mb/s.

Another surprising result from profiling shows that about 35% of the time is spent on reader.Read() [not surprising] and the remainder in GetInt32 and GetDouble() [very surprising].

16
  • 7
    Why would you want to bring 5 million rows to memory all at once? Nothing can speed that up beyond a certain point. Might want to rethink that design. Commented Nov 5, 2015 at 8:24
  • 1
    @Hanky웃Panky Thanks, but I really do want bring them ALL into memory and keep them there for an indefinite period - with full knowledge of why that may not appear to be good design. Commented Nov 5, 2015 at 8:28
  • Measure your loop: check whether the Read() or the Add() takes the most time. Commented Nov 5, 2015 at 8:31
  • @CL. From profiling: most time is taken in Read() then in GetInt32() and then GetDouble() Commented Nov 5, 2015 at 8:43
  • 2
    @Hanky웃Panky Fewer columns in an index results in fewer leaf pages stored on disc. Fewer pages means less disc io to load the table into memory. If the desired columns are 1/10 of the columns in the clustered index, then having an index covering only those columns could be a 90% reduction in data loaded from disc into memory and thus 1/10 the query time. Commented Nov 6, 2015 at 0:44

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.