8

I have 6 tables in my SQLite database, each table with 6 columns(Date, user, NormalA, specialA, contact, remarks) and 1000+ rows.

How can I use sqlalchemy to sort through the Date column to look for duplicate dates, and delete that row?

1
  • 2
    Q1: do you have also a separate primary key column? Q2: Why is the fact that you have 6 tables important for this question? Commented Mar 20, 2016 at 18:45

3 Answers 3

3

Assuming this is your model:

class MyTable(Base):
    __tablename__ = 'my_table'
    id = Column(Integer, primary_key=True)
    date = Column(DateTime)
    user = Column(String)
    # do not really care of columns other than `id` and `date`
    # important here is the fact that `id` is a PK

following are two ways to delete you data:

  1. Find duplicates, mark them for deletion and commit the transaction
  2. Create a single SQL query which will perform deletion on the database directly.

For both of them a helper sub-query will be used:

# helper subquery: find first row (by primary key) for each unique date
subq = (
    session.query(MyTable.date, func.min(MyTable.id).label("min_id"))
    .group_by(MyTable.date)
) .subquery('date_min_id')

Option-1: Find duplicates, mark them for deletion and commit the transaction

# query to find all duplicates
q_duplicates = (
    session
    .query(MyTable)
    .join(subq, and_(
        MyTable.date == subq.c.date,
        MyTable.id != subq.c.min_id)
    )
)

for x in q_duplicates:
    print("Will delete %s" % x)
    session.delete(x)
session.commit()

Option-2: Create a single SQL query which will perform deletion on the database directly

sq = (
    session
    .query(MyTable.id)
    .join(subq, and_(
        MyTable.date == subq.c.date,
        MyTable.id != subq.c.min_id)
    )
).subquery("subq")

dq = (
    session
    .query(MyTable)
    .filter(MyTable.id.in_(sq))
).delete(synchronize_session=False)
Sign up to request clarification or add additional context in comments.

2 Comments

What is 'Base' or what does it stand for in the Class? This code is not runnable without defining it.
@Apostolos, please see the Declarative Mapping part of documentation. Base is the base class for mapped classes/tables.
1

Inspired by the Find duplicate values in SQL table this might help you to select duplicate dates:

query = session.query(
    MyTable
).\
    having(func.count(MyTable.date) > 1).\
    group_by(MyTable.date).all()

If you only want to show unique dates; distinct on is what you might need

2 Comments

Cant do a func.count. There doesn't seem to be that choice
@jakewong Did you import the function: from sqlalchemy import func?
1

While I like the whole object oriented approache with SQLAlchemy, sometimes I find it easier to directly use some SQL. And since the records don't have a key, we need the row number (_ROWID_) to delete the targeted records and I don't think the API provides it.

So first we connect to the database:

from sqlalchemy import create_engine
db = create_engine(r'sqlite:///C:\temp\example.db')
eng = db.engine

Then to list all the records:

for row in eng.execute("SELECT * FROM TableA;") :
  print row

And to display all the duplicated records where the dates are identical:

for row in eng.execute("""
  SELECT * FROM {table}
  WHERE {field} IN (SELECT {field} FROM {table} GROUP BY {field} HAVING COUNT(*) > 1)
  ORDER BY {field};
  """.format(table="TableA", field="Date")) :
  print row

Now that we identified all the duplicates, they probably need to be fixed if the other fields are different:

eng.execute("UPDATE TableA SET NormalA=18, specialA=20 WHERE Date = '2016-18-12' ;");
eng.execute("UPDATE TableA SET NormalA=4,  specialA=8  WHERE Date = '2015-18-12' ;");

And finnally to keep the first inserted record and delete the most recent duplicated records :

print eng.execute("""
  DELETE FROM {table} 
  WHERE _ROWID_ NOT IN (SELECT MIN(_ROWID_) FROM {table} GROUP BY {field});
  """.format(table="TableA", field="Date")).rowcount

Or to keep the last inserted record and delete the other duplicated records :

print eng.execute("""
  DELETE FROM {table} 
  WHERE _ROWID_ NOT IN (SELECT MAX(_ROWID_) FROM {table} GROUP BY {field});
  """.format(table="TableA", field="Date")).rowcount

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.