0

I'm trying to read a large Parquet file using DuckDB within a Jupyter notebook running in VS Code. My goal is to query a subset of the data directly from the Parquet file without loading the entire dataset into memory, as my system has limited RAM. However, DuckDB keeps throwing an IOException, despite the file path being correct and readable with pandas.

Here’s the basic code I tried to query a subset of the Parquet file:

import duckdb

# Querying a subset of the Parquet file
query = """
    SELECT * 
    FROM '../Data/train.parquet'
    WHERE date_id <= 85
"""
train_df = duckdb.query(query).to_df()

I receive the following error:

IOException: IO Error: No files found that match the pattern "../Data/train.parquet"

I confirmed that the file path is correct, and both relative (../Data/train.parquet) and absolute paths work when loading with pandas.read_parquet(). For DuckDB, I tried both relative and absolute paths but still received the same IOException. Here’s the code I used with the absolute path:

import os
absolute_path = os.path.abspath('../Data/train.parquet')
query = f"""
    SELECT * 
    FROM '{absolute_path}'
    WHERE date_id <= 85
"""
train_df = duckdb.query(query).to_df()

The Parquet file is confirmed to be located at the specified path and loads fine with pandas. However, I want to avoid using pandas to read the file, as that would load the entire dataset into memory, which my system cannot handle due to RAM limitations.

Why is DuckDB unable to read this Parquet file directly, even though it loads correctly with pandas? Is there a specific configuration or setup required for DuckDB to work with Parquet files in Jupyter Notebooks or VS Code? Are there alternative ways to query large Parquet files in DuckDB without loading everything into memory?

Any insights or suggestions would be greatly appreciated.

3
  • use '../train.parquet' or '../../Data/train.parquet' not '../Data/train.parquet', the directory above you has no name per se in this use case. if confused, show the output of 'ls -al ..' , there should be no named 'Data' component Commented Nov 8, 2024 at 14:31
  • @ticktalk I tried your suggested solution, but it gives me the same error. I realized that I did not specify my project setup, in which the root folder holds multiple folders, two of which are "Data" and "src". The "Data" folder contains train.parquet. "ls -al", is a linux command, but os.listdir("..") shows that the current directory contains a Data folder. I appreciate your help, and was hoping you could help me further. Commented Nov 8, 2024 at 16:13
  • show your workings all input all outputs, error messages etc - copy/paste, do not verbalise ... 'it doesn't work' is pretty useless without supporting evidence as to what 'it' is. i've posted my attempts in the 'answer' section hopefully that will help. Commented Nov 9, 2024 at 15:54

1 Answer 1

0

hopefully these examples will be of some help (run on linux mint OS)

tree /testing
testing
├── Data
│   ├── source
│   │   ├── abspath.py
│   │   ├── relative.py
│   │   ├── tick1.py
│   │   ├── tick2.py
│   │   └── testMe.py
│   └── train.parquet
└── title

2 directories, 7 files


$ cat testMe.py
import duckdb
import sys
import os

absolute_path = os.path.abspath(f'{sys.argv[1]}') #pass in the filename on the command line

query = f""" SELECT * FROM '{absolute_path}' """

print(f"\nSCRIPT:{sys.argv[0]} filename:{sys.argv[1]}")
print(f'absolute_path:[{absolute_path}]')
print(f'QUERY:[{query}]')

try:
    ls=os.listdir( os.path.dirname(sys.argv[1] )) # get the path only
except Exception as e:
    print(f"ERROR: {e}\n\n" )
    print(f'Cannot execute query: {query}\n' )
else:
    print(f'os.listdir({ls}\n')

    train_df = duckdb.query(query).to_df()
    print(f'ROWS:{train_df.shape[0]}\n')


$######################
$# first run - full path to file
$#
$ python testMe.py /testing/Data/train.parquet

SCRIPT:testMe.py filename:/testing/Data/train.parquet
absolute_path:[/testing/Data/train.parquet]
QUERY:[ SELECT * FROM '/testing/Data/train.parquet' ]
os.listdir(['train.parquet', 'source']

ROWS:3376567

$######################
$# second run - relative path referencing the Data directory
$#
$ python testMe.py ../../Data/train.parquet

SCRIPT:testMe.py filename:../../Data/train.parquet
absolute_path:[/testing/Data/train.parquet]
QUERY:[ SELECT * FROM '/testing/Data/train.parquet' ]
os.listdir(['train.parquet', 'source']

ROWS:3376567

$######################
$# third run - relative path referencing the parent Data diretory simply as '..'
$#
$ python testMe.py ../train.parquet

SCRIPT:testMe.py filename:../train.parquet
absolute_path:[/testing/Data/train.parquet]
QUERY:[ SELECT * FROM '/testing/Data/train.parquet' ]
os.listdir(['train.parquet', 'source']

ROWS:3376567


$######################
$# fourth run - relative path referencing the parent Data diretory simply as '../Data' (erroneously)
$#
$ python testMe.py ../Data/train.parquet

SCRIPT:testMe.py filename:../Data/train.parquet
absolute_path:[/testing/Data/Data/train.parquet]
QUERY:[ SELECT * FROM '/testing/Data/Data/train.parquet' ]
ERROR: [Errno 2] No such file or directory: '../Data'

Cannot execute query:  SELECT * FROM '/testing/Data/Data/train.parquet'

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.