2

Context

I'm working with several files in a proprietary format that store the results of a power system solution. The data is formatted fairly simply, but each result file is ~50MB. There is an API provided to query the file format, but I need to do lots of queries, and the API is horrendously slow.

I wrote a program to compare several of these files to each other using the API, and left it running for a couple of hours to no avail. My next thought was to do a single pass over the file, store the data I need into a sqlite3 database, and then query that. That got me a result in 20 minutes. Much better. Restructured the data to avoid JOINs where possible: 12 minutes. Stored the .db file in a temporary local location instead of on the network: 8.5 minutes.

Further Improvement

The program is more or less tolerable at it's current speed, but this program will be running many, many times per day when it's completed. At the moment, 62% of the run time is spent on 721 calls of .execute/.fetchone.

      160787763 function calls (160787745 primitive calls) in 503.061 seconds
Ordered by: internal time
List reduced from 1507 to 20 due to restriction <20>
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   721  182.869    0.254  182.869    0.254 {method 'fetchone' of 'sqlite3.Cursor' objects}
   721  129.355    0.179  129.355    0.179 {method 'execute' of 'sqlite3.Cursor' objects}
 24822   45.734    0.002   47.600    0.002 {method 'executemany' of 'sqlite3.Connection' objects}

Since so much time is spent in this small section, I thought I would ask for any ideas to improve it before I move forward. I feel like I may be missing something simple a more experienced eye will catch. This particular part of the program is basically structured like this:

for i, db in enumerate(dbs):
    for key, vals in dict.iteritems():
        # If it already has a value, no need to get a comparison value
        if not vals[i]:
            solution_id = key[0]
            num = key[1]

            # Only get a comparison value if the solution is valid for the current db
            if solution_id in db.valid_ids:
                db.cur.execute("""SELECT value FROM table WHERE solution == ? AND num == ?""",
                               (solution_id, num))
                try:
                    vals[i] = db.cur.fetchone()[0]
                # .fetchone() could have returned None, no __getitem__
                except TypeError:
                    pass

The dict structure is:

dict = {(solution_id, num): [db1_val, db2_val, db3_val, db4_val]}

Every entry has at least one db_val, the others are None. The purpose of the loop above is to fill every db_val spot that can be filled, so you can compare values.

The Question

I've read that sqlite3 SELECT statements can only be executed with .execute, so that removes my ability to use .executemany (which saved me tons of time on INSERTS). I've also read on the python docs that using .execute directly from the connection object can be more efficient, but I can't do that since I need to fetch the data.

Is there a better way to structure the loop, or the query, to minimize the amount of time spent on .execute and .fetchone statements?

The Answer

Based on the answers provided by CL and rocksportrocker, I changed my table create statement (simplified version) from:

CREATE TABLE table(
solution integer, num integer, ..., value real,
FOREIGN KEY (solution) REFERENCES solution (id),
FOREIGN KEY (num) REFERENCES nums (id)
);

to:

CREATE TABLE table(
solution integer, num integer, ..., value real,
PRIMARY KEY (solution, num),
FOREIGN KEY (solution) REFERENCES solution (id),
FOREIGN KEY (num) REFERENCES nums (id)
) WITHOUT ROWID;

In my test case,

  • File sizes remained the same
  • The .executemany INSERT statements increased from ~46 to ~69 seconds
  • The .execute SELECT statements decreased from ~129 to ~5 seconds
  • The .fetchone statements decreased from ~183 to ~0 seconds
  • Total time reduced from ~503 seconds to ~228 seconds, 45% of the original time

Any other improvements are still welcomed, hopefully this can become a good reference question for others who are new to SQL.

3
  • Have you tried doing a simple regex search (either with Python or with the Linux command grep just for testing purposes)? Great first question, by the way, and welcome! Commented Sep 28, 2017 at 12:57
  • Are you suggesting to use regexs and avoid creating the .db altogether? If so, I haven't for three reasons: 1) I'm leaving the option to keep the .db after the comparison so the full results are available to look at; 2) I have lots of linked data in different tables that I am manipulating in SQL, finally creating four views which summarize the data I need to start the comparison; and 3) I don't know how to use regexs. :) Thanks for the welcome! Commented Sep 28, 2017 at 13:14
  • Note that we don't put answers inside the question on this site. If you have hybrid solution based on previous answers, it is completely acceptable to post an answer to your own question to indicate how you solved it! (also note if you give us the file format we can help with Regex on this site as well, it is one of the tags) Commented Sep 28, 2017 at 19:36

2 Answers 2

1

The execute() and fetchone() calls are where the database does all its work.

To speed up the query, the lookup columns must be indexed. To save space, you can use a clustered index, i.e., make the table a WITHOUT ROWID table.

Sign up to request clarification or add additional context in comments.

1 Comment

Created a PRIMARY KEY (solution, num) on my table, and added WITHOUT ROWID. This kept my file size exactly where it was, increased write time from ~46 to ~70 seconds, but reduced total time from ~500 to ~230 seconds. Huge improvement. I didn't know about the WITHOUT ROWID option, thanks so much!
0

Did you consider to introuce an index on the solution column ? Would increase insertion time and size of the .db file.

5 Comments

I eliminated most indices due to file size. The sqlite3 db is coming out at roughly 5x the size of the proprietary file (which I'm guessing is compressed). So, with just 4 files I'm creating ~1GB of databases. This tool will regularly be used for 12-15 files at a time, and size becomes an issue.
I gave this a shot just to test it out. Instead of ~250MB dbs, I ended up with ~350MB dbs. My primary inserts went from ~46 to ~96 seconds. But, my total time went from ~503 to ~260 seconds. I don't love the extra 100MB, but this is definitely a good option. I could write with the index, compare, delete the dbs, and then rewrite them without the index in less time than my current solution. Thanks so much, I had no idea the impact would be that large. I'll have to read up on indexing, I am fairly inexperienced in SQL.
You also might consider to replace the full loop for key, vals in dict.iteritems(): by a SELECT which returns one row per iteration. How large would the result of a plain "SELECT value FROM table " be ? I ask this, because a single select returning many values can be much faster than iterating over many selects.
I looked at your code again, you'd better set a combined index on solution and num instead the index on solution.
Well, result of a single SELECT would be small, but the table has ~300k rows in it, and I'm only interested in 721 of them in this case. You and @CL were right on indexing the table. The WITHOUT ROWID solved the increased file size issue too.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.