Get data frame columns and its value as variables in pyspark

Question

I am fetching data from mysql table using pyspark like below.

df = sqlContext.read.format("jdbc").option("url", "{}:{}/{}".format(domain,port,mysqldb)).option("driver", "com.mysql.jdbc.Driver").option("dbtable", "(select ifnull(max(id),0) as maxval, ifnull(min(id),0) as minval, ifnull(min(test_time),'1900-01-01 00:00:00') as mintime, ifnull(max(test_time),'1900-01-01 00:00:00') as maxtime FROM `{}`) as `{}`".format(table, table)).option("user", "{}".format(mysql_user)).option("password", "{}".format(password)).load()

The result of df.show() is below

+------+------+-------------------+-------------------+
|maxval|minval|            mintime|            maxtime|
+------+------+-------------------+-------------------+
|  1721|     1|2017-03-09 22:15:49|2017-12-14 05:17:04|
+------+------+-------------------+-------------------+

Now I want to get column and its value seperately.

I want to get

max_valval = 1721
min_valval = 1
min_timetime = 2017-03-09 22:15:49
max_timetime = 2017-12-14 05:17:04

I have done like below.

 max_val = df.select('maxval').collect()[0].asDict()['maxval']
 min_val = df.select('minval').collect()[0].asDict()['minval']
 max_time = df.select('maxtime').collect()[0].asDict()['maxtime']
 min_time = df.select('mintime').collect()[0].asDict()['mintime']

Is there a better way to do this in pyspark.

Rakesh Kumar · Accepted Answer · 2017-12-15 06:01:16Z

2

Currently you are using collect 4 times which is cost effective. You can try some python skills to do this. I have one approach which you can try:-

df = (sqlContext.read.format("jdbc")
    .option("url", "{}:{}/{}".format(domain,port,mysqldb))
    .option("driver", "com.mysql.jdbc.Driver")
    .option("dbtable", """(
        select ifnull(max(id),0) as maxval, ifnull(min(id),0) as minval, 
               ifnull(min(test_time),'1900-01-01 00:00:00') as mintime, 
               ifnull(max(test_time), '1900-01-01 00:00:00') as maxtime 
         FROM `{}`) as `{}`""".format(table, table))
    .option("user", "{}".format(mysql_user))
    .option("password", "{}".format(password)).load())


for key, value in df.first().asDict().items():
    globals()[key] = value

print minval
print maxval
print mintime
print maxtime

In this way you can convert columns to variable. In case you need further assistance, let me know.

edited Dec 15, 2017 at 6:01

answered Dec 14, 2017 at 19:02

Rakesh Kumar

4,4522 gold badges19 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

User12345 Over a year ago

I am getting the following error Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: too many values to unpack

Rakesh Kumar Over a year ago

@Question_bank just a small mistake I forget to add items after .asDict().items(). I modified answer you can check now.

Alex Raj Kaliamoorthy Over a year ago

@RakeshKumar What if df.show() has more than one result?

Rakesh Kumar Over a year ago

@AlexRajKaliamoorthy in this case it’s like 2-Dimensional array.

Collectives™ on Stack Overflow

Get data frame columns and its value as variables in pyspark

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related