0

I developed a script on python and sqlalchemy to get and update the last activity of my active users.

But the users are increasing a lot, now i´m getting the following error

psycopg2.ProgrammingError: Statement is too large. Statement Size: 16840277 bytes. Maximum Allowed: 16777216 bytes

I was thinking if I update the file postgres.conf it will work, so with the help of pgtune I updated the file, but it does not work, so I updated my kernel on /etc/syslog.conf, with the following parameters

kern.sysv.shmmax=4194304
kern.sysv.shmmin=1
kern.sysv.shmmni=32
kern.sysv.shmseg=8
kern.sysv.shmall=1024

and again it does not work.

After that I divide my query into slices to reduce the size but I got the same error.

How can know what parameter I need to update, to increase the size of my statement?

Workflow

query = "SELECT id FROM {}.{} WHERE status=TRUE".format(schema, customer_table)
ids = ["{}".format(i)for i in pd.read_sql(query, insert_uri).id.tolist()]

read_query = """
SELECT id,
 MAX(CONVERT_TIMEZONE('America/Mexico_City', last_activity)) lastactivity
FROM activity WHERE
DATE_TRUNC('d', CONVERT_TIMEZONE('America/Mexico_City', last_activity)) =
DATE_TRUNC('d', CONVERT_TIMEZONE('America/Mexico_City', CURRENT_DATE))-{} and
 id in ({})
GROUP BY id
""".format(day, ",".join(ids))

last_activity = pd.read_sql(read_query, read_engine, parse_dates=True)
6
  • 1
    Do you really need a statement 16840277 bytes long? Commented Mar 12, 2016 at 7:46
  • Indeed, I've thought this limit is like 167772 % larger than the maximum statement that we'd conceivably need in our application that does some really badass analytics. Where's the SQLAlchemy code? Commented Mar 12, 2016 at 7:51
  • Yes, this workflow reduced considerable the time of my process, it is not normal? I have ~800k users, but some users are inactive, so first I have to determine what users are active, and after I compute only the users active with this workflow I can reduced the time. Commented Mar 12, 2016 at 7:59
  • @AnttiHaapala I updated my workflow, I'm using pandas to read the database and return a DataFrame to transform other stuff. Commented Mar 12, 2016 at 8:13
  • 2
    You must not format your parameters into the query; you must use placeholders instead. Commented Mar 12, 2016 at 8:17

1 Answer 1

3

If you are only fetching the IDs from the database and not filtering them by any other way, there is no need to fetch them at all, you can just insert the SQL statement as a subquery into the second:

SELECT id,
 MAX(CONVERT_TIMEZONE('America/Mexico_City', last_activity)) lastactivity
FROM activity WHERE
 DATE_TRUNC('d', CONVERT_TIMEZONE('America/Mexico_City', last_activity)) =
 DATE_TRUNC('d', CONVERT_TIMEZONE('America/Mexico_City', CURRENT_DATE))-%s and
 id in (
    SELECT id FROM customerschema.customer WHERE status=TRUE
 )
GROUP BY id

Also, as Antti Haapala said, don't use string formatting for SQL parameters, because it is insecure and if any parameter contains appropriate quotes, postgres will interpret them as commands instead of data.

Sign up to request clarification or add additional context in comments.

1 Comment

That's the way to do it. You have a database, so you should use its power.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.