How to convert BigQuery SQL query result to Spark DataFrame?

Question

From the official documentation, we can see that it loads the table into Spark DataFrame first and then perform query with .sql().

words = spark.read.format('bigquery') \
  .option('table', 'bigquery-public-data:samples.shakespeare') \
  .load()
words.createOrReplaceTempView('words')

# Perform word count.
word_count = spark.sql(
    'SELECT word, SUM(word_count) AS word_count FROM words GROUP BY word')
word_count.show()
word_count.printSchema()

Can I do something similar by loading the table according to the query result? Here is my code that loading BigQuery result to Pandas DataFrame.

sql_query = 'Some Queries'
credentials, project = google.auth.default(scopes=[
        'https://www.googleapis.com/auth/drive',
        'https://www.googleapis.com/auth/bigquery',
    ])
client = bigquery.Client(credentials=credentials, project=project)
df = client.query(sql_query).to_dataframe()

I knew that we can convert Pandas DataFrame to Spark DataFrame. I am looking for a cleaner and faster way.

David Rabinowitz · Accepted Answer · 2019-11-19 17:11:32Z

1

The spark-bigquery-connector relies on the BigQuery storage API, that reads directly from the table's files and allows to distribute the read. The BigQuery client reads the entire content of the result in a single thread.

You can however use the view support added to the connector since version 0.10.0-beta by first creating a view with your SQL query, and then read the view directly to a dataframe.

answered Nov 19, 2019 at 17:11

David Rabinowitz

30.5k16 gold badges95 silver badges125 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to convert BigQuery SQL query result to Spark DataFrame?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related