0

I am having Invalid status code '400' errors with every time I tried to show the pyspark dataframe. My AWS sagemaker driver and executor memory are 32G.

-Env:

Python version : 3.7.6
pyspark version : '2.4.5-amzn-0'
Notebook instance : 'ml.t2.2xlarge'

-EMR cluster config

{"classification":"livy-conf","properties":{"livy.server.session.timeout":"5h"}},
{"classification":"spark-defaults","properties":{"spark.driver.memory":"20G"}}

After some manipulation, I cleaned data and reduced the data size. The dataframe should be correct

print(df.count(), len(df.columns))
print(df.show())
(1642, 9)

 stock     date     time   spread  time_diff    ...
  VOD      01-01    9:05    0.01     1132       ...
  VOD      01-01    9:12    0.03     465        ...
  VOD      01-02   10:04    0.02     245
  VOD      01-02   10:15    0.01     364     
  VOD      01-02   10:04    0.02     12

However if I continue to do filtering,

new_df= df.filter(f.col('time_diff')<= 1800)
new_df.show()

then I got this error

An error was encountered:
Invalid status code '400' from http://11.146.133.8:8990/sessions/34/statements/8 with error payload: {"msg":"requirement failed: Session isn't active."}

I really have no idea whats going on.

Can someone please advise ?

Thanks

6
  • It looks like your session is time out and there is a lots of reason causing it time out. Although it's from the EMR, this post might help you: stackoverflow.com/questions/58062824/… Commented Aug 9, 2022 at 1:25
  • Thanks @Jonathan . I followed those posts as per suggested. Updated livy time out and driver memory, but the issue still exist. Commented Aug 9, 2022 at 9:11
  • Hi @FlyUFalcon, could you share more about: 1. The original size of your df 2. How do you save your data (parquet or csv or ...)? 3. How many partition do you have in your df? 4. Do you have any data skewness? As you mentioned, you call some action like count() and show() and it's still work at this moment but failed after further processing, I believe it should relate to insufficient memory or single partition transformation overload your executor. Commented Aug 9, 2022 at 11:29
  • Hi @Jonathan , dataframe shape is (1642, 9) . After I converted it to pandas, the memory usage is 109.2+ KB. Thx. Commented Aug 9, 2022 at 12:14
  • Hi @FlyUFalcon, 109.2+ KB is your source data size or after transformation? How do you save your source data and how many partition do you have when you read the dataset? Commented Aug 12, 2022 at 2:02

3 Answers 3

1

I haven't seen this error before, but as you mentioned that you have only 1 partition and you got this error in the process but not at the beginning, I believe it should relate to the OOM issue.

Please try to do the repartition based on the total number of core you use:

# read the data, let say you are reading the parquet file and you have total 20 cores
df = spark.read.parquet("/path/of/your/data")
df = df.repartition(20)

Also if your dataframe will be reused, you should use the df.persist().

Sign up to request clarification or add additional context in comments.

Comments

0

You need to change livy.server.session.timeout parameter. Answers here or here

1 Comment

Thanks Yes. I already did that in my cluster config.
0

After days of searching for results. I finally got the answer to solve the question. I dont know what wrong with my config setting, but I need to update drivermemory in spark terminal.

just upgrade the memory form there, and the issue will be gone.

enter image description here

enter image description here

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.