Spark read single column from PostgreSQL table

Question

Question

Is there a way to load a specific column from a (PostreSQL) database table as a Spark DataFrame?

Below is what I've tried.

Expected behavior:

The code below should result in only the specified column being stored in memory, not the entire table (table is too large for my cluster).

# make connection in order to get column names
conn = p2.connect(database=database, user=user, password=password, host=host, port="5432")
cursor = conn.cursor()
cursor.execute("SELECT column_name FROM information_schema.columns WHERE table_name = '%s'" % table)

for header in cursor:
    header = header[0]
    df = spark.read.jdbc('jdbc:postgresql://%s:5432/%s' % (host, database), table=table, properties=properties).select(str(header)).limit(10)
    # doing stuff with Dataframe containing this column's contents here before continuing to next column and loading that into memory
    df.show()

Actual behavior:

Out of memory exception occurs. I'm presuming it is because Spark attempts to load the entire table and then select a column, rather than just loading the selected column? Or is it actually loading just the column, but that column is too large; I limited the column to just 10 values, so that shouldn't be the case?

2018-09-04 19:42:11 ERROR Utils:91 - uncaught error in thread spark-listener-group-appStatus, stopping SparkContext
java.lang.OutOfMemoryError: GC overhead limit exceeded

pasha701 · Accepted Answer · 2018-09-04 20:03:30Z

2

SQL query with one column only can be used in jdbc instead of "table" parameter, please find some details here:

spark, scala & jdbc - how to limit number of records

answered Sep 4, 2018 at 20:03

pasha701

7,2171 gold badge17 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

pehr.ans Over a year ago

I replaced the table argument with a query string: 'SELECT %s FROM %s' % (header, table). However, I get an error with the SELECT keyword from the postgresql driver. I noticed I can rewrite the read as spark.read.format('jdbc') and pass in the query string through the dbtable option, but is there a way to do this with the spark.read.jdbc function?

pasha701 Over a year ago

parentheses are important in such queries, please look: docs.databricks.com/spark/latest/data-sources/…

pehr.ans Over a year ago

Thanks! The query string '(SELECT %s FROM %s) %s' % (header, table, header) did the trick.

Collectives™ on Stack Overflow

Spark read single column from PostgreSQL table

Question

Expected behavior:

Actual behavior:

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

Question

Expected behavior:

Actual behavior:

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related