I am trying to read 30 CSV files, around 12Gb uncompressed and 30million rows, into a single Parquet file using DuckDB. I have upgraded today to the latest version 0.10.0.
Each one of my CSV is standardised so I am able to define the column names and types upfront and have done that as part of my query
COPY (SELECT * FROM read_csv('*.csv',
delim = ',',
header = false,
parallel = true,
auto_detect = false,
columns = {
'UPRN': 'BIGINT',
'UDPRN': 'INT',
'CHANGE_TYPE': 'VARCHAR(1)',
70 more attributes))
TO 'output.parquet' (FORMAT 'parquet', COMPRESSION 'zstd', ROW_GROUP_SIZE 100000)
Each time I run this I experience an Out of Memory error.
So I have since tried loading the CSV data into a persistent local database and then doing a COPY export
COPY table TO 'output.parquet' (FORMAT 'parquet', COMPRESSION 'zstd', ROW_GROUP_SIZE 100000)
Same out of memory error messages.
Error: Out of Memory Error: failed to allocate data of size 4.2 MiB (13.9 GiB/13.9 GiB used)
I read the recent release notes and there is a new Temporary Memory Manager which I can use and set a memory limit using
PRAGMA memory_limit = '10GB';
This definitely seems to make the loading more stable but has no impact in the Parquet export.
I am on a Windows laptop with 32Gb of RAM using the latest DuckDB CLI.
I am sure there are other Python type approaches to doing this but would prefer to stick with DuckDB as I am using it in another process and it is a great tool!