1

I'm trying to figure out what the best practice is when it comes to reading data from within a package.

I understood that in python 3.12 I should use importlib.resources, see e.g. How to read a (static) file from inside a Python package?

Initially, I organized the package as follows:

foo
├── setup.cfg
├── data
│   ├── __init__.py
│   └── data.csv
│   └── data2
│       ├── __init__.py
│       └── data2.csv
└── src
    └── foo
        ├── __init__.py
        └── bar.py
    

Such that data could be read in bar.py, as follows:

import importlib.resources

def get_data_file(file_name):
    return importlib.resources.files("foo.data").joinpath(file_name).read_text()
    

I added the following to setup.cfg:

include_package_data = True

[options.package_data]
mypkg = data/*.csv
mypkg.data2 = data2/*.csv

I intended to use foo as follows:

import foo.bar

foo.bar.get_data_file('data.csv')
foo.bar.get_data_file('data2/data2.csv')

However, I got the error message No module named 'foo.data'

I suspect that instead, my package should be organized as:

foo
├── setup.cfg
└── src
    ├── data
    │   ├── __init__.py
    │   └── data.csv
    │   └── data2
    │       ├── __init__.py
    │       └── data2.csv
    └── foo
        ├── __init__.py
        └── bar.py

and bar.py should be changed to:

import importlib.resources

def get_data_file(file_name):
    return importlib.resources.files("data").joinpath(file_name).read_text()

While I can now read the text, I wonder whether this is the best practice, in terms of organizing the package, setting up setup.cfg, and the syntax related to importlib.

In particular, personally, I though would've been more logical to put the data folder in the root folder, instead of the src folder. Moreover, the syntax in setup.cfg is a bit confusing to me.

EDIT

Following up on 9769953's comments:

  1. Could you elaborate on using relative import vs. importlib?

  2. Indeed the data inherently belongs to foo. Does that mean that the package should be organized as follows?

    foo
    ├── setup.cfg
    └── src
        └── foo
            ├── data
            │   ├── __init__.py
            │   └── data.csv
            │       └── data2
            │           ├── __init__.py
            │           └── data2.csv
            ├── __init__.py
            └── bar.py
  1. I'm not completely sure I follow the question regarding helper functions.

I'll try to clarify the use case. I foresee that users would want to be able to install this package, and use it such that they can carry out calculations given data provided in data.csv, and data2.csv. In particular, these data will be parsed to pandas DataFrames.

If I understand your question correctly, you wonder whether this is the correct way to provide users with both the package, and the required data, correct?

  1. I believe you're referring to example5.py, right? Seeing that this example seemed to be inconsistent with the advice given in the linked question, I got a bit confused regarding the best practice.

Furthermore, let's assume we would like to try to get the first filetree to work. I was wondering whether this could be achieved by adapting setup.cfg, in particular, I was wondering whether I could/should add data to:

[options.packages.find]
where = src
  1. I based the inclusion of __init__.py in the data folders on https://importlib-resources.readthedocs.io/en/latest/using.html and https://www.youtube.com/watch?v=ZsGFU2qh73E. That said, I now see in wim's answer that this is deprecated. I have removed those __init__.py, and can succesfully read the files, confirming that these are not needed anymore.

  2. Regarding pkgutil, it seems that importlib is favored over pkgutil as of python 3.9, right?

12
  • 2
    Your own suggestion for the reorganisation of your packages seems to be the correct one (assuming both foo and data belong to the same package, which seems to be also called foo). Commented Apr 26, 2024 at 13:33
  • 1
    Then, when you have an __init__.py file in base src, you can even do away with importlib, and use a relative import to import things. Commented Apr 26, 2024 at 13:34
  • 1
    Question is: does data inheritenly belong to foo? Then it should be moved inside foo, not alongside it. In my opinion (from a cursory glance at your question), that is the correct solution. Otherwise, make it two separate packages (with separate setup.cfg, pyproject.toml, setup.py files or whatevery your preferred style is). Commented Apr 26, 2024 at 13:35
  • 1
    Do you intend to help helper functions in the data (sub)package for reading the file? Or are the Python files there just to structure the data? Because the latter seems a bit awkward, structuring data with Python files if it's not a Python package. Commented Apr 26, 2024 at 13:40
  • 1
    Thanks. I'm going to retract my statements a bit; I got confused by your data directories having __init__.py files. I think those shouldn't be there (after all, it's data, not a Python subpackage). Commented Apr 26, 2024 at 14:06

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.