I am having
Invalid status code '400' errors with every time I tried to show the pyspark dataframe. My AWS sagemaker driver and executor memory are 32G.
-Env:
Python version : 3.7.6
pyspark version : '2.4.5-amzn-0'
Notebook instance : 'ml.t2.2xlarge'
-EMR cluster config
{"classification":"livy-conf","properties":{"livy.server.session.timeout":"5h"}},
{"classification":"spark-defaults","properties":{"spark.driver.memory":"20G"}}
After some manipulation, I cleaned data and reduced the data size. The dataframe should be correct
print(df.count(), len(df.columns))
print(df.show())
(1642, 9)
stock date time spread time_diff ...
VOD 01-01 9:05 0.01 1132 ...
VOD 01-01 9:12 0.03 465 ...
VOD 01-02 10:04 0.02 245
VOD 01-02 10:15 0.01 364
VOD 01-02 10:04 0.02 12
However if I continue to do filtering,
new_df= df.filter(f.col('time_diff')<= 1800)
new_df.show()
then I got this error
An error was encountered:
Invalid status code '400' from http://11.146.133.8:8990/sessions/34/statements/8 with error payload: {"msg":"requirement failed: Session isn't active."}
I really have no idea whats going on.
Can someone please advise ?
Thanks


df2. How do you save your data (parquetorcsvor ...)? 3. How many partition do you have in your df? 4. Do you have any data skewness? As you mentioned, you call someactionlikecount()andshow()and it's still work at this moment but failed after further processing, I believe it should relate to insufficient memory or single partition transformation overload your executor.