I'm trying to figure out what the best practice is when it comes to reading data from within a package.
I understood that in python 3.12 I should use importlib.resources, see e.g. How to read a (static) file from inside a Python package?
Initially, I organized the package as follows:
foo
├── setup.cfg
├── data
│ ├── __init__.py
│ └── data.csv
│ └── data2
│ ├── __init__.py
│ └── data2.csv
└── src
└── foo
├── __init__.py
└── bar.py
Such that data could be read in bar.py, as follows:
import importlib.resources
def get_data_file(file_name):
return importlib.resources.files("foo.data").joinpath(file_name).read_text()
I added the following to setup.cfg:
include_package_data = True
[options.package_data]
mypkg = data/*.csv
mypkg.data2 = data2/*.csv
I intended to use foo as follows:
import foo.bar
foo.bar.get_data_file('data.csv')
foo.bar.get_data_file('data2/data2.csv')
However, I got the error message No module named 'foo.data'
I suspect that instead, my package should be organized as:
foo
├── setup.cfg
└── src
├── data
│ ├── __init__.py
│ └── data.csv
│ └── data2
│ ├── __init__.py
│ └── data2.csv
└── foo
├── __init__.py
└── bar.py
and bar.py should be changed to:
import importlib.resources
def get_data_file(file_name):
return importlib.resources.files("data").joinpath(file_name).read_text()
While I can now read the text, I wonder whether this is the best practice, in terms of organizing the package, setting up setup.cfg, and the syntax related to importlib.
In particular, personally, I though would've been more logical to put the data folder in the root folder, instead of the src folder. Moreover, the syntax in setup.cfg is a bit confusing to me.
EDIT
Following up on 9769953's comments:
Could you elaborate on using relative import vs. importlib?
Indeed the data inherently belongs to foo. Does that mean that the package should be organized as follows?
foo
├── setup.cfg
└── src
└── foo
├── data
│ ├── __init__.py
│ └── data.csv
│ └── data2
│ ├── __init__.py
│ └── data2.csv
├── __init__.py
└── bar.py
- I'm not completely sure I follow the question regarding helper functions.
I'll try to clarify the use case. I foresee that users would want to be able to install this package, and use it such that they can carry out calculations given data provided in data.csv, and data2.csv. In particular, these data will be parsed to pandas DataFrames.
If I understand your question correctly, you wonder whether this is the correct way to provide users with both the package, and the required data, correct?
- I believe you're referring to example5.py, right? Seeing that this example seemed to be inconsistent with the advice given in the linked question, I got a bit confused regarding the best practice.
Furthermore, let's assume we would like to try to get the first filetree to work. I was wondering whether this could be achieved by adapting setup.cfg, in particular, I was wondering whether I could/should add data to:
[options.packages.find]
where = src
I based the inclusion of
__init__.pyin the data folders on https://importlib-resources.readthedocs.io/en/latest/using.html and https://www.youtube.com/watch?v=ZsGFU2qh73E. That said, I now see in wim's answer that this is deprecated. I have removed those__init__.py, and can succesfully read the files, confirming that these are not needed anymore.Regarding pkgutil, it seems that importlib is favored over pkgutil as of python 3.9, right?
fooanddatabelong to the same package, which seems to be also calledfoo).__init__.pyfile in basesrc, you can even do away withimportlib, and use a relative import to import things.datainheritenly belong tofoo? Then it should be moved insidefoo, not alongside it. In my opinion (from a cursory glance at your question), that is the correct solution. Otherwise, make it two separate packages (with separatesetup.cfg,pyproject.toml,setup.pyfiles or whatevery your preferred style is).data(sub)package for reading the file? Or are the Python files there just to structure the data? Because the latter seems a bit awkward, structuring data with Python files if it's not a Python package.__init__.pyfiles. I think those shouldn't be there (after all, it's data, not a Python subpackage).