13

I want to step through a python-spark code while still using yarn. The way I current do it is to start pyspark shell, copy-paste and then execute the code line by line. I wonder whether there is a better way.

pdb.set_trace() would be a much more efficient option if it works. I tried it with spark-submit --master yarn --deploy-mode client. The program did stop and give me a shell at the line where pdb.set_trace() was called. However, any pdb commands entered in the shell simply hanged. The pdb.set_trace() was inserted between spark function calls which, as I understand, should be executed in the driver that runs locally and with a terminal attached. I read this post How can pyspark be called in debug mode? which seems to suggest the use of pdb is impossible without relying IDE(PyCharm). However, if interactively running spark code is possible, there should be a way to ask python-spark "run all the way until this line and give me a shell for REPL(interactive use). I haven't found any ways to do this. Any suggestions/references are appreciated.

2
  • You can use jupyter notebook with pyspark. Commented Mar 13, 2018 at 3:07
  • 1
    @pault: We still need to copy-paste and execute the codes line by line even with jupyter notebook. I want to step through the code just like what pdb allows us to do in plain python. Commented Mar 13, 2018 at 17:22

3 Answers 3

7

I also experienced the hanging of pdb. I found pdb_clone, and it works like a charm.

First, install pdb_clone

> pip install pdb_clone

Then, include these lines where you want to debug.

from pdb_clone import pdb
pdb.set_trace_remote()

When your program is on that line, run pdb-attach command on another terminal.

> pdb-attach
Sign up to request clarification or add additional context in comments.

1 Comment

pdb_clone was last time updated 6 years ago. pip install pdb_clone does not add pdb-attach command.
2

Check out this tool called pyspark_xray which enables you to step into 100% of your PySpark code using PyCharm, below is a high level summary extracted from its doc.

pyspark_xray is a diagnostic tool, in the form of Python library, for pyspark developers to debug and troubleshoot PySpark applications locally, specifically it enables local debugging of PySpark RDD or DataFrame transformation functions that run on slave nodes.

The purpose of developing pyspark_xray is to create a development framework that enables PySpark application developers to debug and troubleshoot locally and do production runs remotely using the same code base of a pyspark application. For the part of debugging Spark application code locally, pyspark_xray specifically provides capability of locally debugging Spark application code that runs on slave nodes, the missing of this capability is an unfilled gap for Spark application developers right now.

Problem

For developers, it's very important to do step-by-step debugging of every part of an application locally in order to diagnose, troubleshoot and solve problems during development.

If you develop PySpark applications, you know that PySpark application code is made up of two categories:

  • code that runs on master node
  • code that runs on worker/slave nodes

While code on master node can be accessed by a debugger locally, code on slave nodes is like a blackbox and not accessible locally by debugger.

Plenty tutorials on web have covered steps of debugging PySpark code that runs on master node, but when it comes to debugging PySpark code that runs on slave nodes, no solution can be found, most people refer to this part of code either as a blackbox or no need to do debugging.

Spark code that runs on slave nodes includes but is not limited to: lambda functions that are passed as input parameter to RDD transformation functions.

Solution

pyspark_xray library enables developers to locally debug (step into) 100% of Spark application code, not only code that runs on master node, but also code that runs on slave nodes, using PyCharm and other popular IDE such as VSCode.

This library achieves these capabilties by using the following techniques:

  • wrapper functions of Spark code on slave nodes, check out the section to learn more details
  • practice of sampling input data under local debugging mode in order to fit the application into memory of your standalone local PC/Mac
    • For exmple, say your production input data size has 1 million rows, which obviously cannot fit into one standalone PC/Mac's memory, in order to use pyspark_xray, you may take 100 sample rows as input data as input to debug your application locally using pyspark_xray
  • usage of a flag to auto-detect local mode, CONST_BOOL_LOCAL_MODE from pyspark_xray's const.py auto-detects whether local mode is on or off based on current OS, with values:
    • True: if current OS is Mac or Windows
    • False: otherwise

in your Spark code base, you can locally debug and remotely execute your Spark application using the same code base.

Comments

0

Another approach (worked for me)

  1. Follow remote-pdb installation and use instructions to add the set_trace command to your code.

    • If you know a desired and available port ahead of time, I would recommend, for example:
      from remote_pdb import RemotePdb
      RemotePdb('127.0.0.1', 6543).set_trace()
      
  2. Remove any remote-debugging related SPARK_SUBMIT_OPTS options, for example agentlib:jdwp... (See an example). This may interfere with the remote pdb.

  3. In a different terminal session, use the nc or equivalent command to connect (as described in the link), and you will be able to use normal pdb commands afterwards.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.