Extract data from a dataframe in pyspark

Question

I am trying to create a script in pyspark which will take the min and max dates from a table store them in a df, then split these two values into 2 variables and then place these variables as a time range in another query. My problem is that dates is a dataframe like this

+--------+--------+
| maxDate| minDate|
+--------+--------+
|20210701|20210629|
+--------+--------+

And I want only the values of the maxDate and minDate.

I tried dates.iloc[0] and var1 = dates['maxDate'].values[0] but it didn't worked.

from pyspark.sql import SparkSession
from pyspark.sql.functions import when
from pyspark.sql import functions as F
from pyspark.sql.functions import lit
from pyspark.sql.functions import trim
from datetime import datetime


current_timestamp = datetime.strftime(datetime.now(), "%Y%m%d%H%M")

spark = SparkSession.builder.appName("testing") \
.config("hive.exec.dynamic.partition", "true") \
.config("hive.exec.dynamic.partition.mode", "nonstrict") \
.config("hive.exec.compress.output=false", "false") \
.config("spark.unsafe.sorter.spill.read.ahead.enabled", "false") \
.config("spark.debug.maxToStringFields", 1000)\
.enableHiveSupport() \
.getOrCreate()

spark.sql("set max_row_size = 6mb")
dates = spark.sql("SELECT MAX(date) as maxDate, MIN(date) as minDate FROM db.table")
 
#dates must be split here in two separated vars

result = spark.sql("select * from db.table_2 where date between {} and {}".format(var1,var2)

User12345 · Accepted Answer · 2021-07-01 15:06:07Z

1

You can do like below

max_date = df.collect()[0][0]

min_date = df.collect()[0][1]

answered Jul 1, 2021 at 15:06

User12345

5,54015 gold badges62 silver badges112 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Extract data from a dataframe in pyspark

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related