I am deploying pyspark in my aks Kubernetes cluster using this guides:
- https://towardsdatascience.com/ignite-the-spark-68f3f988f642
- http://blog.brainlounge.de/memoryleaks/getting-started-with-spark-on-kubernetes/
I have deployed my driver pod as is explained in the links above:
apiVersion: apps/v1
kind: Deployment
metadata:
namespace: spark
name: my-notebook-deployment
labels:
app: my-notebook
spec:
replicas: 1
selector:
matchLabels:
app: my-notebook
template:
metadata:
labels:
app: my-notebook
spec:
serviceAccountName: spark
containers:
- name: my-notebook
image: pidocker-docker-registry.default.svc.cluster.local:5000/my-notebook:latest
ports:
- containerPort: 8888
volumeMounts:
- mountPath: /root/data
name: my-notebook-pv
workingDir: /root
resources:
limits:
memory: 2Gi
volumes:
- name: my-notebook-pv
persistentVolumeClaim:
claimName: my-notebook-pvc
---
apiVersion: v1
kind: Service
metadata:
namespace: spark
name: my-notebook-deployment
spec:
selector:
app: my-notebook
ports:
- protocol: TCP
port: 29413
clusterIP: None
Then I can create the spark cluster using the following code:
import os
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
# Create Spark config for our Kubernetes based cluster manager
sparkConf = SparkConf()
sparkConf.setMaster("k8s://https://kubernetes.default.svc.cluster.local:443")
sparkConf.setAppName("spark")
sparkConf.set("spark.kubernetes.container.image", "<MYIMAGE>")
sparkConf.set("spark.kubernetes.namespace", "spark")
sparkConf.set("spark.executor.instances", "7")
sparkConf.set("spark.executor.cores", "2")
sparkConf.set("spark.driver.memory", "512m")
sparkConf.set("spark.executor.memory", "512m")
sparkConf.set("spark.kubernetes.pyspark.pythonVersion", "3")
sparkConf.set("spark.kubernetes.authenticate.driver.serviceAccountName", "spark")
sparkConf.set("spark.kubernetes.authenticate.serviceAccountName", "spark")
sparkConf.set("spark.driver.port", "29413")
sparkConf.set("spark.driver.host", "my-notebook-deployment.spark.svc.cluster.local")
# Initialize our Spark cluster, this will actually
# generate the worker nodes.
spark = SparkSession.builder.config(conf=sparkConf).getOrCreate()
sc = spark.sparkContext
It works.
How can I create an external pod that can execute a python script that lives in my my-notebook-deployment, I can do it in my terminal:
kubectl exec my-notebook-deployment-7669bb6fc-29stw python3 myscript.py
But I would want to be able to automate it executing this command inside another pod