3
from google.cloud import bigquery as bq
import google_auth_oauthlib.flow

query = '''select ... from ...'''

bigquery_client = bq.Client()
table = bq.query.QueryResults(query=query,client=bigquery_client)
table.use_legacy_sql = False
table.run()

# transfer bigquery data to pandas dataframe
columns=[field.name for field in table.schema]
rows = table.fetch_data()
data = []
for row in rows:
    data.append(row)

df = pd.DataFrame(data=data[0],columns=columns)

I want to load more than 10 million rows into python and it worked fine a few weeks ago, but now it only returns 100,000 rows. Anyone knows a reliable way to do this?

5
  • I also tried async_query.py, and played with rows = query_job.results().fetch_data(max_results=1000000). But it seems like they put a cap of 100,000 limit on it somewhere. Is there a way to overwrite the cap? or more efficient way to perform bigquery to python calculation. Commented Aug 15, 2017 at 17:21
  • just wondering, have you run this query in your WebUI or CLI to see if it returns the total rows you expect? Commented Aug 15, 2017 at 17:53
  • I have run in my CLI, the rows is only 100,000. So the cutoff could be either at the table.run() or table.fetch_data(). Commented Aug 15, 2017 at 17:55
  • if the CLI is also returning 100k then as it seems that's actually all you have in your table. Looks like the issue is in your table and not some threshold being hit when bringing the data. Commented Aug 15, 2017 at 18:11
  • I ran the same query in the UI, it returns more than 39 million. But with python program, it's harder to diagnose where the cutoff occurred. Commented Aug 15, 2017 at 18:14

1 Answer 1

4

I just tested this code here and could bring 3 million rows with no caps being applied:

import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'path/to/key.json'

from google.cloud.bigquery import Client

bc = Client()
query = 'your query'

job  = bc.run_sync_query(query)
job.use_legacy_sql = False
job.run()

data = list(job.fetch_data())

Does it work for you?

Sign up to request clarification or add additional context in comments.

9 Comments

I can run your code without a problem. But len(data[0]) is still a list of tuple with 100K items, data[1] is the row count, which is 39 million, and data[2] is a string. Is this also your data structure looks like?
ah I see. It looks like you are using an old version of the BQ Client. I recommend using the version 0.26.0. You can see which version you are using by running: from google.cloud.bigquery import __version__;print(__version__)
Yes, you are right, the version is probably downgraded related with other installations. Now it takes long time to load the table in. I am in the process of finding out an efficient workflow dealing with a lot of data. Do you have any suggestions?
Yeah bringing 40 million rows to a single instance is quite expensive. It really depends on what you want to do. What I recommend is trying to use dataflow implemented in apache beam or have some cluster to run your analyzes, such as dataproc. For the last I have a jupyter integrated with the master cluster and find it really useful for everyday analyzes on data.
Could you please give me more implementation detail for your jupyter workflow? Do you use datalab? For the big data interaction, do you set up a cluster and directly bring in the bigquery data to memory? Any documentation links would be appreciated!
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.