I have two databases with identical schema and I want to effectively do a diff on one of the tables. I.e. return only the unique records, discounting the primary key.
columns = zip(*db1.execute("PRAGMA table_info(foo)").fetchall())[1]
db1.execute("ATTACH DATABASE '/path/to/db1.db' AS db1")
db1.execute("ATTACH DATABASE '/path/to/db2.db' AS db2")
db2.execute("ATTACH DATABASE '/path/to/db1.db' AS db1")
db2.execute("ATTACH DATABASE '/path/to/db2.db' AS db2")
data = db2.execute("""
SELECT
one.*
FROM
db1.foo AS one
JOIN db2.foo
AS two
WHERE {}
""".format(' AND '.join( ['one.{0}!=two.{0}'.format(c) for c in columns[1:]]))
).fetchall()
That is, ignoring the primary key (in this case meow), don't return the records that exist identically in both databases.
The table foo in db1 looks like:
meow mix please deliver
1 123 abc
2 234 bcd two
3 345 cde
And the table foo in db2 looks like:
meow mix please deliver
1 345 cde
2 123 abc one
3 234 bcd two
4 456 def four
So the unique entries from db2 are:
[(2, 123, 'abc', 'one'), (4, 456, 'def', 'four')]
which is what I get. This works great if I have more than two columns. But if there are only two, i.e. a primary key and a value such as in a lookup table:
bar baz bar baz
1 123 1 234
2 234 2 345
3 345 3 123
4 456
I get all non-unique values repeated N-1 times and unique values repeated N times, where N is the number of records in db1. I understand why this is happening but I don't know how to fix it.
[(1, '234'),
(1, '234'),
(2, '345'),
(2, '345'),
(3, '123'),
(3, '123'),
(4, '456'),
(4, '456'),
(4, '456')]
One idea I had was to just take the modulus after pulling all the duplicate results:
N = db1.execute("SELECT Count(*) FROM foo").fetchone()[0]
data = [
list(data)
for data,n in itertools.groupby(sorted(data))
if np.mod(len(list(n)),N)==0
]
Which does work:
[[4, '456']]
But this seems messy and I'd like to do it all in that first SQL query if possible.
Also, on large tables (my real db has ~10k records) this takes a long time. Any way to optimize this? Thanks!