best practice for debugging python-spark code

Question

I want to step through a python-spark code while still using yarn. The way I current do it is to start pyspark shell, copy-paste and then execute the code line by line. I wonder whether there is a better way.

pdb.set_trace() would be a much more efficient option if it works. I tried it with spark-submit --master yarn --deploy-mode client. The program did stop and give me a shell at the line where pdb.set_trace() was called. However, any pdb commands entered in the shell simply hanged. The pdb.set_trace() was inserted between spark function calls which, as I understand, should be executed in the driver that runs locally and with a terminal attached. I read this post How can pyspark be called in debug mode? which seems to suggest the use of pdb is impossible without relying IDE(PyCharm). However, if interactively running spark code is possible, there should be a way to ask python-spark "run all the way until this line and give me a shell for REPL(interactive use). I haven't found any ways to do this. Any suggestions/references are appreciated.

@pault: We still need to copy-paste and execute the codes line by line even with jupyter notebook. I want to step through the code just like what pdb allows us to do in plain python. — sgu
– sgu, Commented Mar 13, 2018 at 17:22

calee · Accepted Answer · 2020-01-21 07:58:28Z

7

I also experienced the hanging of pdb. I found pdb_clone, and it works like a charm.

First, install pdb_clone

> pip install pdb_clone

Then, include these lines where you want to debug.

from pdb_clone import pdb
pdb.set_trace_remote()

When your program is on that line, run pdb-attach command on another terminal.

> pdb-attach

answered Jan 21, 2020 at 7:58

calee

912 silver badges8 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Jussi Kujala Over a year ago

pdb_clone was last time updated 6 years ago. pip install pdb_clone does not add pdb-attach command.

bradyjiang · Accepted Answer · 2021-02-20 17:41:37Z

Check out this tool called pyspark_xray which enables you to step into 100% of your PySpark code using PyCharm, below is a high level summary extracted from its doc.

pyspark_xray is a diagnostic tool, in the form of Python library, for pyspark developers to debug and troubleshoot PySpark applications locally, specifically it enables local debugging of PySpark RDD or DataFrame transformation functions that run on slave nodes.

The purpose of developing pyspark_xray is to create a development framework that enables PySpark application developers to debug and troubleshoot locally and do production runs remotely using the same code base of a pyspark application. For the part of debugging Spark application code locally, pyspark_xray specifically provides capability of locally debugging Spark application code that runs on slave nodes, the missing of this capability is an unfilled gap for Spark application developers right now.

Problem

For developers, it's very important to do step-by-step debugging of every part of an application locally in order to diagnose, troubleshoot and solve problems during development.

If you develop PySpark applications, you know that PySpark application code is made up of two categories:

code that runs on master node
code that runs on worker/slave nodes

While code on master node can be accessed by a debugger locally, code on slave nodes is like a blackbox and not accessible locally by debugger.

Plenty tutorials on web have covered steps of debugging PySpark code that runs on master node, but when it comes to debugging PySpark code that runs on slave nodes, no solution can be found, most people refer to this part of code either as a blackbox or no need to do debugging.

Spark code that runs on slave nodes includes but is not limited to: lambda functions that are passed as input parameter to RDD transformation functions.

Solution

pyspark_xray library enables developers to locally debug (step into) 100% of Spark application code, not only code that runs on master node, but also code that runs on slave nodes, using PyCharm and other popular IDE such as VSCode.

This library achieves these capabilties by using the following techniques:

wrapper functions of Spark code on slave nodes, check out the section to learn more details
practice of sampling input data under local debugging mode in order to fit the application into memory of your standalone local PC/Mac
- For exmple, say your production input data size has 1 million rows, which obviously cannot fit into one standalone PC/Mac's memory, in order to use pyspark_xray, you may take 100 sample rows as input data as input to debug your application locally using pyspark_xray
usage of a flag to auto-detect local mode, CONST_BOOL_LOCAL_MODE from pyspark_xray's const.py auto-detects whether local mode is on or off based on current OS, with values:
- True: if current OS is Mac or Windows
- False: otherwise

in your Spark code base, you can locally debug and remotely execute your Spark application using the same code base.

ELinda · Accepted Answer · 2020-08-19 00:21:08Z

0

Another approach (worked for me)

Follow remote-pdb installation and use instructions to add the set_trace command to your code.
- If you know a desired and available port ahead of time, I would recommend, for example:
```
from remote_pdb import RemotePdb
RemotePdb('127.0.0.1', 6543).set_trace()
```
Remove any remote-debugging related SPARK_SUBMIT_OPTS options, for example agentlib:jdwp... (See an example). This may interfere with the remote pdb.
In a different terminal session, use the nc or equivalent command to connect (as described in the link), and you will be able to use normal pdb commands afterwards.

edited Aug 19, 2020 at 0:21

answered Aug 19, 2020 at 0:15

ELinda

2,8211 gold badge13 silver badges9 bronze badges

Collectives™ on Stack Overflow

best practice for debugging python-spark code

3 Answers 3

1 Comment

Problem

Solution

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Problem

Solution

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related