I get the sense that you wished to avoid depending upon
lazynlp.
But I wish that you had.
More on that below.
Then we would have an up-to-date TLDextract dep, plus justext,
which I think is the big thing you wanted to evict from the deps
and which seems a nice enough library to me.
BTW, though it's not published on pypi,
you can still use a GitHub repo URL to depend on lazynlp.
You can even bake in a particular immutable commit hash.
After you (quickly) ship version 0.1.0,
I urge you to consider using uv to manage dependencies
listed in pyproject.toml, such as httpx.
Consider adding a make install Makefile, or a shell script,
that shows how to pull in deps and assemble a small text corpus.