0

I am trying to create a script in pyspark which will take the min and max dates from a table store them in a df, then split these two values into 2 variables and then place these variables as a time range in another query. My problem is that dates is a dataframe like this

+--------+--------+
| maxDate| minDate|
+--------+--------+
|20210701|20210629|
+--------+--------+

And I want only the values of the maxDate and minDate.

I tried dates.iloc[0] and var1 = dates['maxDate'].values[0] but it didn't worked.

from pyspark.sql import SparkSession
from pyspark.sql.functions import when
from pyspark.sql import functions as F
from pyspark.sql.functions import lit
from pyspark.sql.functions import trim
from datetime import datetime


current_timestamp = datetime.strftime(datetime.now(), "%Y%m%d%H%M")

spark = SparkSession.builder.appName("testing") \
.config("hive.exec.dynamic.partition", "true") \
.config("hive.exec.dynamic.partition.mode", "nonstrict") \
.config("hive.exec.compress.output=false", "false") \
.config("spark.unsafe.sorter.spill.read.ahead.enabled", "false") \
.config("spark.debug.maxToStringFields", 1000)\
.enableHiveSupport() \
.getOrCreate()

spark.sql("set max_row_size = 6mb")
dates = spark.sql("SELECT MAX(date) as maxDate, MIN(date) as minDate FROM db.table")
 
#dates must be split here in two separated vars

result = spark.sql("select * from db.table_2 where date between {} and {}".format(var1,var2)

1 Answer 1

1

You can do like below

max_date = df.collect()[0][0]

min_date = df.collect()[0][1]
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.