3

I have two databases with identical schema and I want to effectively do a diff on one of the tables. I.e. return only the unique records, discounting the primary key.

columns = zip(*db1.execute("PRAGMA table_info(foo)").fetchall())[1]
db1.execute("ATTACH DATABASE '/path/to/db1.db' AS db1")
db1.execute("ATTACH DATABASE '/path/to/db2.db' AS db2")
db2.execute("ATTACH DATABASE '/path/to/db1.db' AS db1")
db2.execute("ATTACH DATABASE '/path/to/db2.db' AS db2")
data = db2.execute("""
    SELECT 
        one.* 
    FROM 
        db1.foo AS one 
        JOIN db2.foo 
        AS two 
    WHERE {}
    """.format(' AND '.join( ['one.{0}!=two.{0}'.format(c) for c in columns[1:]]))
).fetchall()

That is, ignoring the primary key (in this case meow), don't return the records that exist identically in both databases.

The table foo in db1 looks like:

meow    mix    please   deliver
1       123    abc
2       234    bcd      two
3       345    cde

And the table foo in db2 looks like:

meow    mix    please   deliver
1       345    cde
2       123    abc      one
3       234    bcd      two     
4       456    def      four

So the unique entries from db2 are:

[(2, 123, 'abc', 'one'), (4, 456, 'def', 'four')]

which is what I get. This works great if I have more than two columns. But if there are only two, i.e. a primary key and a value such as in a lookup table:

bar  baz         bar   baz
1    123         1     234
2    234         2     345
3    345         3     123
                 4     456

I get all non-unique values repeated N-1 times and unique values repeated N times, where N is the number of records in db1. I understand why this is happening but I don't know how to fix it.

[(1, '234'),
 (1, '234'),
 (2, '345'),
 (2, '345'),
 (3, '123'),
 (3, '123'),
 (4, '456'),
 (4, '456'),
 (4, '456')]

One idea I had was to just take the modulus after pulling all the duplicate results:

N = db1.execute("SELECT Count(*) FROM foo").fetchone()[0]
data = [
     list(data) 
     for data,n in itertools.groupby(sorted(data)) 
     if np.mod(len(list(n)),N)==0
]

Which does work:

[[4, '456']]

But this seems messy and I'd like to do it all in that first SQL query if possible.

Also, on large tables (my real db has ~10k records) this takes a long time. Any way to optimize this? Thanks!

1 Answer 1

3
+50

Replacing my earlier answer -- here is a good general solution.

Having input tables that look like this:

sqlite> select * from t1;
meow        mix         please      delivery  
----------  ----------  ----------  ----------
1           123         abc                   
2           234         bcd         two       
3           345         cde                   

and

sqlite> select * from t2;
meow        mix         please      delivery  
----------  ----------  ----------  ----------
1           345         cde                   
2           123         abc         one       
3           234         bcd         two       
4           456         def         four      

You can get records that are in t2 / not in t1 (ignoring PK's) like this:

select sum(q1.db), mix, please, delivery from (select 1 as db, mix, please,
delivery from t1 union all select 2 as db, mix, please, delivery from t2) q1
group by mix, please, delivery having sum(db)=2; 

sum(q1.db)  mix         please      delivery  
----------  ----------  ----------  ----------
2           123         abc         one       
2           456         def         four      

You can do different set operations by changing the value in the having clause. SUM(DB)=1 returns records in 1 / not in 2; SUM(DB)=2 returns records in 2 / not in 1; SUM(DB)=1 OR SUM(DB)=2 returns records that exist in either but not both; and SUM(DB)=3 returns records that exist in both.

The only thing this doesn't do for you is return the PK. This can't be done in the query I've written because the GROUP BY and SUM operations only work on common / aggregated data, and the PK fields are by definition unique. If you know the combination of non-PK fields is unique within each DB, you could use the returned records to create a new query to find the PK as well.

Note this approach extends nicely to more than 2 tables. By making the db field a power of 2, you can operate on any number of tables. E.g. if you did 1 as db for t1, 2 as db for t2, 4 as db for t3, 8 as db for t4, you could find any intersection / difference of the tables you want by changing the having condition -- e.g. HAVING SUM(DB)=5 would return records that are in t1 and t3 but not in t2 or t4.

Sign up to request clarification or add additional context in comments.

9 Comments

It grabs them correctly but doesn't pull out the unique ones. Should the HAVING condition be different? I get [(1, '123', 'abc', None), (1, '234', 'bcd', 'two'), (1, '345', 'cde', None), (2, '123', 'abc', 'one'), (2, '234', 'bcd', 'two'), (2, '345', 'cde', None), (2, '456', 'def', 'four')] as a result.
Again, this should return [(2, 123, 'abc', 'one'), (4, 456, 'def', 'four')].
Joe, from your sample data, shouldn't this return three records? (1, 123, 'abc, ''), (2, 123, 'abc', 'one') and (4, 456, 'def', 'four')? The two records that have mix=123 aren't identical -- the delivery value is different. (This affects the revised answer I'm about to send you). In Python terms, is this t2-t1 (set difference) or t2 ^ t1 (symmetric difference)?
Oh, maybe I wasn't clear. Sorry! I want just the records in db2.foo that don't appear in bd1.foo, not all the unique records in both dbs (a true diff).
OK, got it -- stand by --
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.