Create a parquet file from CSV represented as string using duckdb

Question

Given the following:

import io
buffer = io.BytesIO()
csv_data = 'col1,col2\n1,2\n3,4`

I want to know how I can use duckdb ( https://duckdb.org/docs/data/parquet/overview.html ) to write a parquet file to the buffer in memory, where file will contain the column/row data from the csv_data variable.

I'm using duckdb version 0.7.1 (I'm not fixed to this version though).

edit

Suggested to try the following:

import duckdb
from io import BytesIO
csv_data = BytesIO(b'col1,col2\n1,2\n3,4')
duckdb.read_csv(csv_data, header=True).write_parquet('csv_data.parquet')

Which failed with:


In [1]: import duckdb

In [2]: from io import BytesIO
   ...:

In [3]: csv_data = BytesIO(b'col1,col2\n1,2\n3,4')
   ...:

In [4]: duckdb.read_csv(csv_data, header=True).write_parquet('csv_data.parquet')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[4], line 1
----> 1 duckdb.read_csv(csv_data, header=True).write_parquet('csv_data.parquet')

TypeError: read_csv(): incompatible function arguments. The following argument types are supported:
    1. (name: str, connection: duckdb.DuckDBPyConnection = None, header: object = None, compression: object = None, sep: object = None, delimiter: object = None, dtype: object = None, na_values: object = None, skiprows: object = None, quotechar: object = None, escapechar: object = None, encoding: object = None, parallel: object = None, date_format: object = None, timestamp_format: object = None, sample_size: object = None, all_varchar: object = None, normalize_names: object = None, filename: object = None) -> duckdb.DuckDBPyRelation

Invoked with: <_io.BytesIO object at 0x7f21ed64d620>; kwargs: header=True

This works in 0.8.0

jqurious
– jqurious

2023-05-19 19:10:19 +00:00
Commented May 19, 2023 at 19:10 — jqurious
– jqurious, Commented May 19, 2023 at 19:10
@jqurious thanks - I can confirm that this works in 0.8.0

baxx
– baxx

2023-05-20 18:14:48 +00:00
Commented May 20, 2023 at 18:14 — baxx
– baxx, Commented May 20, 2023 at 18:14

baxx · Accepted Answer · 2023-05-20 18:16:02Z

1

You can read it with read_csv and write it to parquet with write_parquet

import duckdb
from io import BytesIO
csv_data = BytesIO(b'col1,col2\n1,2\n3,4')
duckdb.read_csv(csv_data, header=True).write_parquet('csv_data.parquet')

Note - this does not work on version 0.7.1, but does work on 0.8.0

edited May 20, 2023 at 18:16

baxx

4,95414 gold badges57 silver badges129 bronze badges

answered May 11, 2023 at 15:51

Pedro Holanda

3111 silver badge3 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

baxx Over a year ago

Thanks but that didn't work - I'll update the OP with the error I got from that

Collectives™ on Stack Overflow

Create a parquet file from CSV represented as string using duckdb

edit

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

edit

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related