Pyspark: Invalid status code '400' when lazy loading the dataframe

Question

I am having Invalid status code '400' errors with every time I tried to show the pyspark dataframe. My AWS sagemaker driver and executor memory are 32G.

-Env:

Python version : 3.7.6
pyspark version : '2.4.5-amzn-0'
Notebook instance : 'ml.t2.2xlarge'

-EMR cluster config

{"classification":"livy-conf","properties":{"livy.server.session.timeout":"5h"}},
{"classification":"spark-defaults","properties":{"spark.driver.memory":"20G"}}

After some manipulation, I cleaned data and reduced the data size. The dataframe should be correct

print(df.count(), len(df.columns))
print(df.show())

(1642, 9)

 stock     date     time   spread  time_diff    ...
  VOD      01-01    9:05    0.01     1132       ...
  VOD      01-01    9:12    0.03     465        ...
  VOD      01-02   10:04    0.02     245
  VOD      01-02   10:15    0.01     364     
  VOD      01-02   10:04    0.02     12

However if I continue to do filtering,

new_df= df.filter(f.col('time_diff')<= 1800)
new_df.show()

then I got this error

An error was encountered:
Invalid status code '400' from http://11.146.133.8:8990/sessions/34/statements/8 with error payload: {"msg":"requirement failed: Session isn't active."}

I really have no idea whats going on.

Can someone please advise ?

Thanks

It looks like your session is time out and there is a lots of reason causing it time out. Although it's from the EMR, this post might help you: stackoverflow.com/questions/58062824/… — Jonathan
– Jonathan, Commented Aug 9, 2022 at 1:25
Thanks @Jonathan . I followed those posts as per suggested. Updated livy time out and driver memory, but the issue still exist. — FlyUFalcon
– FlyUFalcon, Commented Aug 9, 2022 at 9:11
Hi @FlyUFalcon, could you share more about: 1. The original size of your df 2. How do you save your data (parquet or csv or ...)? 3. How many partition do you have in your df? 4. Do you have any data skewness? As you mentioned, you call some action like count() and show() and it's still work at this moment but failed after further processing, I believe it should relate to insufficient memory or single partition transformation overload your executor. — Jonathan
– Jonathan, Commented Aug 9, 2022 at 11:29
Hi @Jonathan , dataframe shape is (1642, 9) . After I converted it to pandas, the memory usage is 109.2+ KB. Thx. — FlyUFalcon
– FlyUFalcon, Commented Aug 9, 2022 at 12:14
Hi @FlyUFalcon, 109.2+ KB is your source data size or after transformation? How do you save your source data and how many partition do you have when you read the dataset? — Jonathan
– Jonathan, Commented Aug 12, 2022 at 2:02

Jonathan · Accepted Answer · 2022-08-12 09:13:13Z

1

I haven't seen this error before, but as you mentioned that you have only 1 partition and you got this error in the process but not at the beginning, I believe it should relate to the OOM issue.

Please try to do the repartition based on the total number of core you use:

# read the data, let say you are reading the parquet file and you have total 20 cores
df = spark.read.parquet("/path/of/your/data")
df = df.repartition(20)

Also if your dataframe will be reused, you should use the df.persist().

answered Aug 12, 2022 at 9:13

Jonathan

2,3332 gold badges12 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

stahh · Accepted Answer · 2022-08-12 15:52:14Z

0

You need to change livy.server.session.timeout parameter. Answers here or here

answered Aug 12, 2022 at 15:52

stahh

1565 bronze badges

1 Comment

FlyUFalcon Over a year ago

Thanks Yes. I already did that in my cluster config.

FlyUFalcon · Accepted Answer · 2022-08-16 07:57:58Z

0

After days of searching for results. I finally got the answer to solve the question. I dont know what wrong with my config setting, but I need to update drivermemory in spark terminal.

just upgrade the memory form there, and the issue will be gone.

answered Aug 16, 2022 at 7:57

FlyUFalcon

3541 gold badge5 silver badges22 bronze badges

Collectives™ on Stack Overflow

Pyspark: Invalid status code '400' when lazy loading the dataframe

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related