Dynamically generate batched data for import

Question

I need to execute a test in which I have to simulate 20 years' historical data in PostgreSQL (and TimescaleDB) DB. My problem is that I have to generate .sql files and ingest them (using psql-client) in the targeted DB.

We made some calculations and the result is that our database will have 261 billion rows in our table for 20 years, so each year contains 13.05B data.

For each row, we have a timestamp (integer type) and I thought, to be more efficient, I write, in my .sql files transactions of 10.000 elements. To make the generated .sql files small in disk space (I generate those files in python), I limited each file to 20M rows each.

So I thought that I could generate these files dynamically in a bash file and, when a file is generated, I run a psql command to ingest it into the DB, but the problem is that I don't know how to handle it: The ingest process takes much more time than .sql file generation, so in bash commands, I am afraid my bash script will wait the ingest process before starting to generate a new .sql file and execute the ingest process again.

So to summarize, I try to create an ingest process pseudo-batch based in which each generated .sql file that has been ingested successfully will be removed to avoid to take too much disk space.

How to avoid the fact it will wait the ingest process before starting an other .sql file generation and then start the ingest process?

This is for development purpose, these data I want to ingest are close to the one we would like to create in production mode. The aim, for now, is to execute read requests and compare those request in PostgreSQL and TimescaleDB.

Michael Kutz · Accepted Answer · 2018-10-24 13:07:19Z

The actual rate of ingest is going to be based on the number of spindles (physical hard disks) behind the RAID that holds your data files and log files.

A decent RAID setup should grant you the ability to achieve ~1 M rows/sec or higher.

Your algorithm is another major bottle neck.

Method 1

You can skip the create file step by making Python connect to the database directly.

I don't know if there is a Timescale DB driver for Python.

For speed:

Use BIND variables
Cache your Statement Handle
If the driver supports it, use BIND Arrays.

Make sure the "number of rows per BIND Array" and "number of groups between COMMIT" are variables in your program. You'll want to play with these to find the sweet spot for your system. That is: You will need to run benchmarks to find the best values for your setup.

The insert_vendor_list in this example uses the BIND Array technique.

Method 2

Sometimes, you want to review the data before it is generated. In that case, you want to create a .csv file, not an .sql file full of INSERT statements. Each of those INSERT statements will need to be hard parsed.

The RDBMS databases I have played with come with specialized applications that can ingest a CSV at (or near) the maximum rate (as defined by your RAID setup).

Notes on Speed Performance

Depending on what you are really testing, you should disable/remove all indexes and constraints on the target table prior to ingesting the data. Then, create them afterwards.

For my developement, I only have a spinning HDD (1TB) so unfortunately it will be slow. Nevertheless, I am currently testing the first method with the BIND method and Caching the statements with psycopg2. Actually, I have been testing with only BIND arrays, for 20M rows it take more than 27minutes, so Statements are, I suppose, hardly required. I saw the execute_batch method from psycopg2.extras which seems to improve the ingest process (here) If csv ismore appropriatefor fast ingest, I think I should do it — Benjamin Soulas
– Benjamin Soulas, Commented Oct 24, 2018 at 14:54
Finally, I have to use SQL files, but I generate .sql files dynamically, when my bash process finished to ingest one file, it deletes it and take the next one (only 2 files at most), so my disk space is pretty safe. I had to use a HDD (1TB) which is enough for our actual tests (10B data, maybe more depending of our needs), later an other HDD will come. Thanks a lot for you help ! — Benjamin Soulas
– Benjamin Soulas, Commented Oct 25, 2018 at 11:37

gsiems · Accepted Answer · 2018-10-24 13:44:20Z

0

As mentioned by others, INSERT statements are going to be slow. I'd start by looking at piping COPY to psql.

To get an example to work with, use either pg_dump or pgAdmin to dump the data from a table in plain format. Viewing the file, you'll see a line that looks like COPY <table_name> (<column_list>) FROM stdin; followed by the dumped data in tab-separated format with a final terminating line of \..

answered Oct 24, 2018 at 13:44

gsiems

3,4432 gold badges24 silver badges26 bronze badges

Add a comment |

Evan Carroll · Accepted Answer · 2019-01-13 02:48:23Z

0

You can massively speed up your ingest problem by using COPY table FROM 'filename' WITH BINARY instead of a CSV. This will require you to have to dump to binary (using COPY table TO 'filename' WITH BINARY).

Alternatively you can create your own programs to dump to this format too and stream data into the database at very fast speeds. This means the database backend will not have to convert from text to the internal type.

answered Jan 13, 2019 at 2:48

Evan Carroll

65.8k50 gold badges263 silver badges520 bronze badges

Add a comment |

Mike Freedman · Accepted Answer · 2020-03-07 05:51:20Z

0

Your bottleneck with COPY with be that it's executed in a transaction in TimescaleDB, so will be single-threaded.

For bulk import to test insert throughput, we recommend our parallel COPY tool: https://github.com/timescale/timescaledb-parallel-copy

You point it at a massive CSV and it'll chop up the CSV in proper time-order before running parallel COPY commands to the database.

(TimescaleDB member here)

answered Mar 7, 2020 at 5:51

Mike Freedman

3562 silver badges4 bronze badges

Add a comment |

Stack Exchange Network

Dynamically generate batched data for import

4 Answers 4

Method 1

Method 2

Notes on Speed Performance

Your Answer

Hot Network Questions

Dynamically generate batched data for import

4 Answers 4

Method 1

Method 2

Notes on Speed Performance

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions