Count the number of sql dataframe in spark by map function in pyspark

Question

A newbie in spark and have a problem about map function on data frame. I have a spark sql dataframe, named df, assuming it is like:

+----------+------------+------+
|      time|         tag| value|
+----------+------------+------+
|1399766400|A00000000001|1000.0|
|1399766401|A00000000002|1001.0|
+----------+------------+------+

I can select part of them based on the tag value with the command:

temp = sqlContext.sql("SELECT * FROM df WHERE tag = 'A00000000001'")
temp.show(1)

then we have:

+----------+------------+------+
|      time|         tag| value|
+----------+------------+------+
|1399766400|A00000000001|1000.0|
+----------+------------+------+

Currently, I have a list

x = ["SELECT * FROM df WHERE tag = 'A00000000001'", "SELECT * FROM df WHERE tag = 'A00000000002'"]

which has been stored as RDD variable and I would like to apply map function on it to count the number of dataframe selected based on them, I tried the function like:

y = x.map(lambda x: sqlContext.sql(x).count())
y.take(2)

I supposed that the return value should be [1, 1], but it gives the error:

TypeError: 'JavaPackage' object is not callable

Is it possible to execute a map function on a dataframe with this method? if not, how should I do.

zero323 · Accepted Answer · 2016-05-21 19:38:56Z

2

As already stated it is not possible to execute nested operations on distributed data structures. In more general sense Spark is not a database. Spark data structures, including DataFrames are not designed for tasks like single record retrieval.

If all the queries follow the same pattern where you use simple filter by column it is only a matter of simple aggregation with and join:

tags = sc.parallelize([("A00000000001", ), ("A00000000002", )]).toDF(["tag"])
tags.join(df, ["tag"]).groupBy("tag").count()

edited May 21, 2016 at 19:38

answered May 21, 2016 at 19:29

zero323

331k108 gold badges982 silver badges958 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

abd14beb · Accepted Answer · 2016-05-21 18:56:16Z

0

It is not possible. You can use list comprehensions:

>>> xs = ["SELECT * FROM df WHERE tag = 'A00000000001'", "SELECT * FROM df WHERE tag = 'A00000000002'"]
>>> [sqlContext.sql(x).count() for x in xs]

answered May 21, 2016 at 18:56

abd14beb

1

2 Comments

Fly_back Over a year ago

So, if the list in rdd variable, then I have to collect then first? it will take a long time to collect.

eliasah Over a year ago

Unless your RDD is bound by a "small" number of elements, it's recommended to avoid collecting as it brings all the data to master and it may "blow it" up giving an OME. I would definitely consider @zero323 answer as a cleaner solution.

Collectives™ on Stack Overflow

Count the number of sql dataframe in spark by map function in pyspark

2 Answers 2

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related