0

A newbie in spark and have a problem about map function on data frame. I have a spark sql dataframe, named df, assuming it is like:

+----------+------------+------+
|      time|         tag| value|
+----------+------------+------+
|1399766400|A00000000001|1000.0|
|1399766401|A00000000002|1001.0|
+----------+------------+------+

I can select part of them based on the tag value with the command:

temp = sqlContext.sql("SELECT * FROM df WHERE tag = 'A00000000001'")
temp.show(1)

then we have:

+----------+------------+------+
|      time|         tag| value|
+----------+------------+------+
|1399766400|A00000000001|1000.0|
+----------+------------+------+

Currently, I have a list

x = ["SELECT * FROM df WHERE tag = 'A00000000001'", "SELECT * FROM df WHERE tag = 'A00000000002'"]

which has been stored as RDD variable and I would like to apply map function on it to count the number of dataframe selected based on them, I tried the function like:

y = x.map(lambda x: sqlContext.sql(x).count())
y.take(2)

I supposed that the return value should be [1, 1], but it gives the error:

TypeError: 'JavaPackage' object is not callable

Is it possible to execute a map function on a dataframe with this method? if not, how should I do.

0

2 Answers 2

2

As already stated it is not possible to execute nested operations on distributed data structures. In more general sense Spark is not a database. Spark data structures, including DataFrames are not designed for tasks like single record retrieval.

If all the queries follow the same pattern where you use simple filter by column it is only a matter of simple aggregation with and join:

tags = sc.parallelize([("A00000000001", ), ("A00000000002", )]).toDF(["tag"])
tags.join(df, ["tag"]).groupBy("tag").count()
Sign up to request clarification or add additional context in comments.

Comments

0

It is not possible. You can use list comprehensions:

>>> xs = ["SELECT * FROM df WHERE tag = 'A00000000001'", "SELECT * FROM df WHERE tag = 'A00000000002'"]
>>> [sqlContext.sql(x).count() for x in xs]

2 Comments

So, if the list in rdd variable, then I have to collect then first? it will take a long time to collect.
Unless your RDD is bound by a "small" number of elements, it's recommended to avoid collecting as it brings all the data to master and it may "blow it" up giving an OME. I would definitely consider @zero323 answer as a cleaner solution.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.