1

From the official documentation, we can see that it loads the table into Spark DataFrame first and then perform query with .sql().

words = spark.read.format('bigquery') \
  .option('table', 'bigquery-public-data:samples.shakespeare') \
  .load()
words.createOrReplaceTempView('words')

# Perform word count.
word_count = spark.sql(
    'SELECT word, SUM(word_count) AS word_count FROM words GROUP BY word')
word_count.show()
word_count.printSchema()

Can I do something similar by loading the table according to the query result? Here is my code that loading BigQuery result to Pandas DataFrame.

sql_query = 'Some Queries'
credentials, project = google.auth.default(scopes=[
        'https://www.googleapis.com/auth/drive',
        'https://www.googleapis.com/auth/bigquery',
    ])
client = bigquery.Client(credentials=credentials, project=project)
df = client.query(sql_query).to_dataframe()

I knew that we can convert Pandas DataFrame to Spark DataFrame. I am looking for a cleaner and faster way.

1 Answer 1

1

The spark-bigquery-connector relies on the BigQuery storage API, that reads directly from the table's files and allows to distribute the read. The BigQuery client reads the entire content of the result in a single thread.

You can however use the view support added to the connector since version 0.10.0-beta by first creating a view with your SQL query, and then read the view directly to a dataframe.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.