How to Read multiple files in different Pyspark Dataframes using spark.read.jdbc

Question

I have a code to read multiple files (>10) into different dataframes in Pyspark. However, I would like to optimize this piece of code using a for loop and a reference variable or something like that. My code is as follows:

Features_PM = (spark.read
          .jdbc(url=jdbcUrl, table='Features_PM',
                properties=connectionProperties))

Features_CM = (spark.read
          .jdbc(url=jdbcUrl, table='Features_CM',
                properties=connectionProperties))

I tried something like this but it didn't work:

table_list = ['table1', 'table2','table3', 'table4']

for table in table_list:
     jdbcDF = spark.read \
         .format("jdbc") \
         .option("url", "jdbc:postgresql:dbserver") \
         .option("dbtable", "schema.{}".format(table)) \
         .option("user", "username") \
         .option("password", "password") \
         .load()

Source for the above snippet: https://community.cloudera.com/t5/Support-Questions/read-multiple-table-parallel-using-Spark/td-p/286498

Any help would be appreciated. Thanks

Get all the table name for that DB in a list, now create a generic function and read all the table name by iterating the list.. in this way you can have a function in order to read all the tables ... code reusability — dsk
– dsk, Commented Nov 24, 2020 at 16:14

Yayati Sule · Accepted Answer · 2021-01-03 16:53:17Z

1

You can use the following code to achieve your end goal. You will get a dictionary of dataframes where the key is the table name and value is teh appropriate dataframe

def read_table(opts):
    return spark.read.format("jdbc").options(**opts).load()

table_list = ['table1', 'table2','table3', 'table4']



table_df_dict = {table: read_table({"url":"jdbc:postgresql:dbserver",
                                   "dbtable":"schema.{}".format(table),
                                   "user": "username",
                                   "password":"password"})
                 for table in table_list}

answered Jan 3, 2021 at 16:53

Yayati Sule

1,62914 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to Read multiple files in different Pyspark Dataframes using spark.read.jdbc

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related