Check out this tool called pyspark_xray which enables you to step into 100% of your PySpark code using PyCharm, below is a high level summary extracted from its doc.
pyspark_xray is a diagnostic tool, in the form of Python library, for pyspark developers to debug and troubleshoot PySpark applications locally, specifically it enables local debugging of PySpark RDD or DataFrame transformation functions that run on slave nodes.
The purpose of developing pyspark_xray is to create a development framework that enables PySpark application developers to debug and troubleshoot locally and do production runs remotely using the same code base of a pyspark application. For the part of debugging Spark application code locally, pyspark_xray specifically provides capability of locally debugging Spark application code that runs on slave nodes, the missing of this capability is an unfilled gap for Spark application developers right now.
Problem
For developers, it's very important to do step-by-step debugging of every part of an application locally in order to diagnose, troubleshoot and solve problems during development.
If you develop PySpark applications, you know that PySpark application code is made up of two categories:
- code that runs on master node
- code that runs on worker/slave nodes
While code on master node can be accessed by a debugger locally, code on slave nodes is like a blackbox and not accessible locally by debugger.
Plenty tutorials on web have covered steps of debugging PySpark code that runs on master node, but when it comes to debugging PySpark code that runs on slave nodes, no solution can be found, most people refer to this part of code either as a blackbox or no need to do debugging.
Spark code that runs on slave nodes includes but is not limited to: lambda functions that are passed as input parameter to RDD transformation functions.
Solution
pyspark_xray library enables developers to locally debug (step into) 100% of Spark application code, not only code that runs on master node, but also code that runs on slave nodes, using PyCharm and other popular IDE such as VSCode.
This library achieves these capabilties by using the following techniques:
- wrapper functions of Spark code on slave nodes, check out the section to learn more details
- practice of sampling input data under local debugging mode in order to fit the application into memory of your standalone local PC/Mac
- For exmple, say your production input data size has 1 million rows, which obviously cannot fit into one standalone PC/Mac's memory, in order to use pyspark_xray, you may take 100 sample rows as input data as input to debug your application locally using pyspark_xray
- usage of a flag to auto-detect local mode,
CONST_BOOL_LOCAL_MODE from pyspark_xray's const.py auto-detects whether local mode is on or off based on current OS, with values:
- True: if current OS is Mac or Windows
- False: otherwise
in your Spark code base, you can locally debug and remotely execute your Spark application using the same code base.