pyspark: run a python script and see results on commandline

Question

I execute a python script file in pyspark 1.6.2 (yes an old one for certification training reasons).

spark-submit --master yarn-cluster s01.py

when run it just returns "Application report for application_somelongnumber". What I was expecting, was that it shows the output of my script-command. So that I can check if I developed correctly. What should I do better to get what I want?

The content of my script:

#!/usr/bin/python

from pyspark.sql import Row
from pyspark.sql.functions import *
from pyspark import SparkContext
sc = SparkContext(appName = "solution01")

a = sc.textFile("/data/crime.csv")
b = a.take(1)
sc.stop()
print(b)

UPDATE : When I execute pyspark s01.py I see my results but that is not the intended behaviour, because I want it to be executed with parameters on the cluster.

Harsh Bafna · Accepted Answer · 2017-12-15 08:53:06Z

1

1) Print statements will not work in yarn mode. Instead use foreach like this :

myRDD.collect().foreach(println)

2) You should use yarn-client mode instead of yarn-cluster while debugging, in which case the spark driver will be created on the machine from where you execute the spark-submit command.

3) When you are executing a spark command in yarn-cluster mode. The logs cannot be seen on console during execution. There is a URL generated with application id. You can check the logs at the given url.

Alternatively you can download the logs from the cluster to the local machine, once the execution is completed, using the command :

yarn logs -applicationId <application>

answered Dec 15, 2017 at 8:53

Harsh Bafna

2,2341 gold badge17 silver badges26 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

pyspark: run a python script and see results on commandline

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related