1

I need to extract a big amount of data(>1GB) from a database to a csv file. I'm using this script:

rs_cursor = rs_db.cursor()
rs_cursor.execute("""SELECT %(sql_fields)s
                     FROM table1""" % {"sql_fields": sql_fields})
sqlData = rs_cursor.fetchall()
rs_cursor.close()

c = csv.writer(open(filename, "wb"))
c.writerow(headers)
for row in sqlData:
    c.writerow(row)

The problem comes when is writing the file the system runs out of memory. In this case, is there any other and more efficient way to create a large csv file?

6
  • 1
    The problem most probably is with sqlData, not the fact the you write this data to a file. Where does this data come from? Do you have any control over it? if you do, you should be looking into reading it in chunks or as a generator. Commented Aug 10, 2016 at 15:09
  • How are you getting the SQL data? Can you show us that code? Commented Aug 10, 2016 at 15:11
  • I added the bit of code of sqlData. The data is coming from a massive table. Commented Aug 10, 2016 at 15:13
  • What database/library are you using? In pymssql you can use fetchmany with the size argument so it doesn't return the whole table at once, see its docs You can also consider using WHERE in order to SELECT from the table in chunks. Commented Aug 10, 2016 at 15:16
  • Thanks DeepSpace, I'm using psycopg2 (redshift). In that case, how can I write the file without overwriting it if I'm reading by chunks? Commented Aug 10, 2016 at 15:20

2 Answers 2

3

psycopg2 (which OP uses) has a fetchmany method which accepts a size argument. Use it to read a certain number of lines from the database. You can expirement with the value of n to balance between run-time and memory usage.

fetchmany docs: http://initd.org/psycopg/docs/cursor.html#cursor.fetchmany

    rs_cursor = rs_db.cursor()
    rs_cursor.execute("""SELECT %(sql_fields)s
                         FROM table1""" % {"sql_fields": sql_fields})
    c = csv.writer(open(filename, "wb"))
    c.writerow(headers)

    n = 100
    sqlData = rs_cursor.fetchmany(n)

    while sqlData:
        for row in sqlData:
            c.writerow(row)
        sqlData = rs_cursor.fetchmany(n)

   rs_cursor.close()


You can also wrap this with a generator to simplify the code a little bit:

def get_n_rows_from_table(n):
    rs_cursor = rs_db.cursor()
    rs_cursor.execute("""SELECT %(sql_fields)s
                             FROM table1""" % {"sql_fields": sql_fields})
    sqlData = rs_cursor.fetchmany(n)

    while sqlData:
        yield sqlData
        sqlData = rs_cursor.fetchmany(n)
    rs_cursor.close()

c = csv.writer(open(filename, "wb"))
c.writerow(headers)

for row in get_n_rows_from_table(100):
    c.writerow(row)
Sign up to request clarification or add additional context in comments.

Comments

0

Have you tried fetchone()?

rs_cursor = rs_db.cursor()
rs_cursor.execute("""SELECT %(sql_fields)s
                     FROM table1""" % {"sql_fields": sql_fields})

c = csv.writer(open(filename, "wb"))
c.writerow(headers)
row = rs_cursor.fetchone()
while row:
    c.writerow(row)
    row = rs_cursor.fetchone()

rs_cursor.close()

2 Comments

While this approach will work, it can be very slow as database I/O tends to be a slow process.
You can see my answer for another approach using fetchmany.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.