1

I am currently using SQLAlchemy to write a pandas dataframe to a postgresql database on an AWS server. My code looks like this

engine = create_engine(
    'postgresql://{}:{}@{}:{}/{}'.format(ModelData.user, ModelData.password, ModelData.host, ModelData.port,
                                             ModelData.database), echo=True)
with open(file, 'rb') as f:
    df = pickle.load(f)
df.to_sql(table_name, engine, method='multi', if_exists='replace', index=False, chunksize=1000)

The table I am writing has about 900,000 rows and 500 columns. It takes quite a long time to complete. Is there a faster way to write this data? Sometimes I will wait all day and still not be complete. To reiterate, this post is about speed and not about execution. Any help would be appreciated!

Note: The machine I'm using has 32 GB of RAM, i7 processor, 1 TB storage, and a GPU so I don't think it's my machine.

0

1 Answer 1

0

Have you played with the chunksize parameter?

With a dataset that large you may need to write it out to a tab-delimited file, transfer it to your EC2 instance, and then use the \copy command in psql to get it done in a reasonable amount of time.

Since it is RDS instead of EC2, I checked, and it looks like RDS PostgreSQL supports COPY from S3: https://www.postgresql.org/docs/current/sql-copy.html

This will mean some work setting up the table yourself, but the time savings may make it worth it. If you can make a small subset of your DF and have DataFrame.to_sql() use it to create the table for you, you can then copy your tab-separated values file to S3 and COPY into it using the utility.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.