I'm trying to read a large Parquet file using DuckDB within a Jupyter notebook running in VS Code. My goal is to query a subset of the data directly from the Parquet file without loading the entire dataset into memory, as my system has limited RAM. However, DuckDB keeps throwing an IOException, despite the file path being correct and readable with pandas.
Here’s the basic code I tried to query a subset of the Parquet file:
import duckdb
# Querying a subset of the Parquet file
query = """
SELECT *
FROM '../Data/train.parquet'
WHERE date_id <= 85
"""
train_df = duckdb.query(query).to_df()
I receive the following error:
IOException: IO Error: No files found that match the pattern "../Data/train.parquet"
I confirmed that the file path is correct, and both relative (../Data/train.parquet) and absolute paths work when loading with pandas.read_parquet(). For DuckDB, I tried both relative and absolute paths but still received the same IOException. Here’s the code I used with the absolute path:
import os
absolute_path = os.path.abspath('../Data/train.parquet')
query = f"""
SELECT *
FROM '{absolute_path}'
WHERE date_id <= 85
"""
train_df = duckdb.query(query).to_df()
The Parquet file is confirmed to be located at the specified path and loads fine with pandas. However, I want to avoid using pandas to read the file, as that would load the entire dataset into memory, which my system cannot handle due to RAM limitations.
Why is DuckDB unable to read this Parquet file directly, even though it loads correctly with pandas? Is there a specific configuration or setup required for DuckDB to work with Parquet files in Jupyter Notebooks or VS Code? Are there alternative ways to query large Parquet files in DuckDB without loading everything into memory?
Any insights or suggestions would be greatly appreciated.