0

I have a csv file which is in below format.

key_string,query

abc,"select * from abc"

pqr,"select * from pqr"

xyz,"select * from xyz"

These tables are in Hive. I want to create dataframes for eg: abc_df,pqr_df and so on. I can be adding more queries to the csv in future. How can I create multiple dataframes in pyspark using for loop or any other technique? I tried following code but its not working: df is I have read the above csv file

x=""
y=[]
for i in df.rdd.collect():
    x= i[0] + "_df"
    x = spark.sql(i[1])
    y.append(x)
print(y)`

Pls suggest next steps

3
  • What do you mean by it’s not working? What is your expected outcome, and what did you obtain from your code? Commented Dec 15, 2020 at 6:04
  • @mck I just want to create dataframes from the queries available in csv files with key_string_df as dataframe name Commented Dec 15, 2020 at 7:04
  • it's a bad idea to have variables as variable names. This is what a dictionary is built for. Do you want a dictionary instead? like {'key_string_df': dataframe, ...} Commented Dec 15, 2020 at 7:06

1 Answer 1

1

I'd suggest using a dictionary for this purpose:

y = dict()
for i in df.rdd.collect():
    y[i[0] + "_df"] = spark.sql(i[1])

If you want to get the dataframes, you can use, for example,

y['abc_df'].show()
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks @mck. that helped a lot. One more thing how can I directly access/use abc_df.show() instead of y['abc_df'].show()
I don't advice using variables as variable names. See my comment in your question. That's why I suggested using dictionary. Is there a problem with using dictionary?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.