0

I am fetching data from mysql table using pyspark like below.

df = sqlContext.read.format("jdbc").option("url", "{}:{}/{}".format(domain,port,mysqldb)).option("driver", "com.mysql.jdbc.Driver").option("dbtable", "(select ifnull(max(id),0) as maxval, ifnull(min(id),0) as minval, ifnull(min(test_time),'1900-01-01 00:00:00') as mintime, ifnull(max(test_time),'1900-01-01 00:00:00') as maxtime FROM `{}`) as `{}`".format(table, table)).option("user", "{}".format(mysql_user)).option("password", "{}".format(password)).load()

The result of df.show() is below

+------+------+-------------------+-------------------+
|maxval|minval|            mintime|            maxtime|
+------+------+-------------------+-------------------+
|  1721|     1|2017-03-09 22:15:49|2017-12-14 05:17:04|
+------+------+-------------------+-------------------+

Now I want to get column and its value seperately.

I want to get

max_valval = 1721
min_valval = 1
min_timetime = 2017-03-09 22:15:49
max_timetime = 2017-12-14 05:17:04

I have done like below.

 max_val = df.select('maxval').collect()[0].asDict()['maxval']
 min_val = df.select('minval').collect()[0].asDict()['minval']
 max_time = df.select('maxtime').collect()[0].asDict()['maxtime']
 min_time = df.select('mintime').collect()[0].asDict()['mintime']

Is there a better way to do this in pyspark.

1 Answer 1

2

Currently you are using collect 4 times which is cost effective. You can try some python skills to do this. I have one approach which you can try:-

df = (sqlContext.read.format("jdbc")
    .option("url", "{}:{}/{}".format(domain,port,mysqldb))
    .option("driver", "com.mysql.jdbc.Driver")
    .option("dbtable", """(
        select ifnull(max(id),0) as maxval, ifnull(min(id),0) as minval, 
               ifnull(min(test_time),'1900-01-01 00:00:00') as mintime, 
               ifnull(max(test_time), '1900-01-01 00:00:00') as maxtime 
         FROM `{}`) as `{}`""".format(table, table))
    .option("user", "{}".format(mysql_user))
    .option("password", "{}".format(password)).load())


for key, value in df.first().asDict().items():
    globals()[key] = value

print minval
print maxval
print mintime
print maxtime

In this way you can convert columns to variable. In case you need further assistance, let me know.

Sign up to request clarification or add additional context in comments.

4 Comments

I am getting the following error Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: too many values to unpack
@Question_bank just a small mistake I forget to add items after .asDict().items(). I modified answer you can check now.
@RakeshKumar What if df.show() has more than one result?
@AlexRajKaliamoorthy in this case it’s like 2-Dimensional array.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.