I have a code to read multiple files (>10) into different dataframes in Pyspark. However, I would like to optimize this piece of code using a for loop and a reference variable or something like that. My code is as follows:
Features_PM = (spark.read
.jdbc(url=jdbcUrl, table='Features_PM',
properties=connectionProperties))
Features_CM = (spark.read
.jdbc(url=jdbcUrl, table='Features_CM',
properties=connectionProperties))
I tried something like this but it didn't work:
table_list = ['table1', 'table2','table3', 'table4']
for table in table_list:
jdbcDF = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql:dbserver") \
.option("dbtable", "schema.{}".format(table)) \
.option("user", "username") \
.option("password", "password") \
.load()
Source for the above snippet: https://community.cloudera.com/t5/Support-Questions/read-multiple-table-parallel-using-Spark/td-p/286498
Any help would be appreciated. Thanks