1

I have parquet files arranged in this format

/db/{year}/table{date}.parquet

In each year folder, there are up to 365 files.

If I want to query data from a time range, say the week 2024-04-28 to 2024-05-04, I could use

SELECT
  count(*) as count,
FROM read_parquet('/db/2024/table*.parquet')
WHERE date >= '2024-04-28' and date < '2024-05-05'

But I don't need to read all files in /db/2024/table*.parquet. I know exactly which seven files has the data I need. How do I define this in duckdb? I am using Python. I could do my own filtering and put all the files in a Python list like filenames.

3 Answers 3

3

Assuming that your files are named like this:

table20240428.parquet

and so on, this should work:

SELECT count(*) AS count
FROM read_parquet(
  list_transform(
    generate_series('2024-05-04'::DATE - 6, '2024-05-04'::DATE , interval '1' day),
    n -> '/db/' || strftime(n, '%Y') || '/table' || strftime(n, '%Y%m%d') || '.parquet'
  )
)
;

If your {date} is different, you can just adjust the strftime function.

Sign up to request clarification or add additional context in comments.

Comments

1

you can pass python list into duckdb query:

import json

files = json.dumps(["df1.parquet", "df3.parquet"])

duckdb.sql(f"""select * from read_parquet({files});""")

Comments

0

Sounds like you need to look at Hive Partitioning?

Seems like the perfect solution for your problem, e.g. use the date field of your parquet files as the partition key: https://duckdb.org/docs/stable/data/partitioning/hive_partitioning.html#hive-partitioning

Filters on the partition keys are automatically pushed down into the files. This way the system skips reading files that are not necessary to answer a query. For example, consider the following query on the above dataset:

HTH

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.