0

I am trying to send the sql result to a for loop. I am new to spark and python, please help.

    from pyspark import SparkContext
sc =SparkContext()
from pyspark.sql import HiveContext
hive_context = HiveContext(sc)
#bank = hive_context.table("cip_utilities.file_upload_temp")
data=hive_context.sql("select * from cip_utilities.cdm_variable_dict")
hive_context.sql("describe cip_utilities.cdm_variables_dict").registerTempTable("schema_def")
temp_data=hive_context.sql("select * from schema_def")
temp_data.show()
data1=hive_context.sql("select col_name from schema_def where data_type<>'string'")
data1.show()

2 Answers 2

2
  • Use DataFrame.collect() method, which aggregates the result of Spark-SQL query from all executors into driver.

  • The collect() method will return a Python list, each element of which is a Spark Row

  • You can then iterate over this list in a for-loop


Code snippet:

data1 = hive_context.sql("select col_name from schema_def where data_type<>'string'")
colum_names_as_python_list_of_rows = data1.collect()
Sign up to request clarification or add additional context in comments.

Comments

2

I think you need to ask yourself why you want to iterate over the data.

Are you doing an aggregation? Transforming the data? If so, consider doing it using the spark API.

Printing some text? If so, then use .collect() and retrieve the data back to your driver process. Then you can loop over the result in the usual python way.

2 Comments

Yes, i am trying to find the maximum, minimum , standard deviation. Thats why need to send each col name in an iteration
You should be using the inbuilt spark functions to do that - it will be far more performant. spark.apache.org/docs/2.2.0/api/python/…

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.