2

I am using PySpark with Flask, in order to have a web service.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from flask import Flask, jsonify
from pyspark import SparkFiles
from pyspark.ml import PipelineModel
from pyspark.sql import SparkSession, SQLContext
from pyspark.sql.functions import col, udf
from pyspark.sql.types import DoubleType, StringType

app = Flask(__name__)

# DEFINE SPARK SESSION
spark = SparkSession \
    .builder \
    .appName("app") \
    .master("<master>") \
    .config("spark.cores.max", 4) \
    .config("spark.executor.memory", "6g") \
    .getOrCreate()

# LOAD THE REQUIRED FILES
modelEnglish = PipelineModel.load(hdfsUrl + "model-english")
ticketEnglishDf = spark.read.parquet(hdfsUrl + "ticket-df-english.parquet").cache()

modelFrench = PipelineModel.load(hdfsUrl + "model-french")
ticketFrenchDf = spark.read.parquet(hdfsUrl + "ticket-df-french.parquet").cache()

def getSuggestions(ticketId, top = 10):
    # ...

    return ...

@app.route("/suggest/<int:ticketId>")
def suggest(ticketId):
    response = {"id": ticketId, "suggestions": getSuggestions(ticketId)}
    return jsonify(response)

if __name__ == "__main__":

    app.run(debug=True, host="127.0.0.1", port=2793, threaded=True)

This works well, when a request is send to the server. But the Spark job are duplicated... and I don't know why ?

I have already tried to create the spark session inside the condition block if __name__ == "__main__":

1 Answer 1

2

Spark uses RDDs, which are lazy collections. Through calling RDD/Dataframe methods you are actually assembling a transformation pipeline. Computation is only triggered once you run an action, like collect, count or write. Normally (unless you cache a collection) it will be recalculated over and over. Caching doesn't guarantee however that the collection won't be recomputed. See documentation on RDDs and caching.

Using Spark in a server application is horribly wrong in the first place. It is a distributed data processing platform that should be used for batch jobs or streaming. Spark jobs are normally writing a file or a database and processing take several hours (or even days) on multiple machines to finish.

I suppose your model is the output of a Spark ML pipeline. It should be small enough to bundle it with your server application and load with usual file IO tools.

Sign up to request clarification or add additional context in comments.

1 Comment

For the application I avoid avoid the use of flask. I have looked for a long time to a streaming system... and now I am using Kafka, and it rocks ! :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.