4

When installing libraries directly in Databricks notebook cells via %pip install, the Python interpreter gets restarted. I have the understanding that in order for the newly installed packages to become visible and accessible to the rest of the notebook cells, the interpreter must be restarted.

How would it be possible to perform this interpreter restart programatically?

I am installing packages using a function call, based on requirements stored on a separate file, but I noticed that newly installed packages are not present, despite the installation seemingly taking place, either on notebook or cluster scope. I figured out that the reason might be the lack of interpreter restart from my installation code.

Is such a behavior possible?

2 Answers 2

5

you can use this:

dbutils.library.restartPython()

for more info look at https://learn.microsoft.com/en-us/azure/databricks/dev-tools/databricks-utils#dbutils-library-restartpython

Sign up to request clarification or add additional context in comments.

Comments

1

Firstly, you can preinstall packages at cluster level or job level before you even start the notebook. I would recommend that before you try anything else. Trust me I've run thousands of libraries including custom built ones and have never needed to-do this.

On to your actual question. Yes but it will throw an exception.

enter image description here

This code will cause you to kill the python process on your DBX workspace

%sh

pid=$(ps aux | grep PythonShell | grep -v grep | awk '{print $2}')
kill -9  $pid

However, it poses a problem for you. If you run this as a bash notebook cell this will not be wrappable in try catch logic which makes automated workflows impossible.

I would have said you can call the shell commands from python but that wouldn't work as the exception would be thrown in that cell. You could perhaps use scala and sys_process library to achieve it but I'm no scala expert sadly.

4 Comments

That's very insightful @Scott Bell, thank you for the detailed answer. I'm aware I can manually install packages on the cluster, or use pip install directly in a notebook, but neither of these seem to allow me to programatically install packages by retrieving package names and version numbers from an external file. That's why I had the idea to write a custom function (with Libraries API, POST requests) but then due to the interpreter not being restarted, the newly installed packages were not (always) accessible for the upcoming cells in the notebook
So I can help with that. You should look up cluster init scripts and this answer covers that. I have achieved what you want before not using the above and instead keeping a file in source control and having my CI/CD process call the databricks API while loading this file. Thus is programmatic & source controlled.
@ScottBell, I use databricks in an enterprise environment, hence can't modify the package's versions at the cluster level. In the article you had linked, there was a hyperlink for "Notebook-scoped python libraries", which installs pip packages with the magic command %pip, but that still warns me to restart python runtime. How do I install packages once (in something like a venv only for the current notebook or for myself) and use it throughout the notebook?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.