1

Question

Is there a way to load a specific column from a (PostreSQL) database table as a Spark DataFrame?

Below is what I've tried.

Expected behavior:

The code below should result in only the specified column being stored in memory, not the entire table (table is too large for my cluster).

# make connection in order to get column names
conn = p2.connect(database=database, user=user, password=password, host=host, port="5432")
cursor = conn.cursor()
cursor.execute("SELECT column_name FROM information_schema.columns WHERE table_name = '%s'" % table)

for header in cursor:
    header = header[0]
    df = spark.read.jdbc('jdbc:postgresql://%s:5432/%s' % (host, database), table=table, properties=properties).select(str(header)).limit(10)
    # doing stuff with Dataframe containing this column's contents here before continuing to next column and loading that into memory
    df.show()

Actual behavior:

Out of memory exception occurs. I'm presuming it is because Spark attempts to load the entire table and then select a column, rather than just loading the selected column? Or is it actually loading just the column, but that column is too large; I limited the column to just 10 values, so that shouldn't be the case?

2018-09-04 19:42:11 ERROR Utils:91 - uncaught error in thread spark-listener-group-appStatus, stopping SparkContext
java.lang.OutOfMemoryError: GC overhead limit exceeded

1 Answer 1

2

SQL query with one column only can be used in jdbc instead of "table" parameter, please find some details here:

spark, scala & jdbc - how to limit number of records

Sign up to request clarification or add additional context in comments.

3 Comments

I replaced the table argument with a query string: 'SELECT %s FROM %s' % (header, table). However, I get an error with the SELECT keyword from the postgresql driver. I noticed I can rewrite the read as spark.read.format('jdbc') and pass in the query string through the dbtable option, but is there a way to do this with the spark.read.jdbc function?
parentheses are important in such queries, please look: docs.databricks.com/spark/latest/data-sources/…
Thanks! The query string '(SELECT %s FROM %s) %s' % (header, table, header) did the trick.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.