load large data from bigquery to python

Question

from google.cloud import bigquery as bq
import google_auth_oauthlib.flow

query = '''select ... from ...'''

bigquery_client = bq.Client()
table = bq.query.QueryResults(query=query,client=bigquery_client)
table.use_legacy_sql = False
table.run()

# transfer bigquery data to pandas dataframe
columns=[field.name for field in table.schema]
rows = table.fetch_data()
data = []
for row in rows:
    data.append(row)

df = pd.DataFrame(data=data[0],columns=columns)

I want to load more than 10 million rows into python and it worked fine a few weeks ago, but now it only returns 100,000 rows. Anyone knows a reliable way to do this?

I also tried async_query.py, and played with rows = query_job.results().fetch_data(max_results=1000000). But it seems like they put a cap of 100,000 limit on it somewhere. Is there a way to overwrite the cap? or more efficient way to perform bigquery to python calculation. — vortex
– vortex, Commented Aug 15, 2017 at 17:21
just wondering, have you run this query in your WebUI or CLI to see if it returns the total rows you expect? — Willian Fuks
– Willian Fuks, Commented Aug 15, 2017 at 17:53
I have run in my CLI, the rows is only 100,000. So the cutoff could be either at the table.run() or table.fetch_data(). — vortex
– vortex, Commented Aug 15, 2017 at 17:55
if the CLI is also returning 100k then as it seems that's actually all you have in your table. Looks like the issue is in your table and not some threshold being hit when bringing the data. — Willian Fuks
– Willian Fuks, Commented Aug 15, 2017 at 18:11
I ran the same query in the UI, it returns more than 39 million. But with python program, it's harder to diagnose where the cutoff occurred. — vortex
– vortex, Commented Aug 15, 2017 at 18:14

Willian Fuks · Accepted Answer · 2017-08-15 18:47:12Z

4

I just tested this code here and could bring 3 million rows with no caps being applied:

import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'path/to/key.json'

from google.cloud.bigquery import Client

bc = Client()
query = 'your query'

job  = bc.run_sync_query(query)
job.use_legacy_sql = False
job.run()

data = list(job.fetch_data())

Does it work for you?

answered Aug 15, 2017 at 18:47

Willian Fuks

11.9k10 gold badges55 silver badges77 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

vortex Over a year ago

I can run your code without a problem. But len(data[0]) is still a list of tuple with 100K items, data[1] is the row count, which is 39 million, and data[2] is a string. Is this also your data structure looks like?

Willian Fuks Over a year ago

ah I see. It looks like you are using an old version of the BQ Client. I recommend using the version 0.26.0. You can see which version you are using by running: from google.cloud.bigquery import __version__;print(__version__)

vortex Over a year ago

Yes, you are right, the version is probably downgraded related with other installations. Now it takes long time to load the table in. I am in the process of finding out an efficient workflow dealing with a lot of data. Do you have any suggestions?

Willian Fuks Over a year ago

Yeah bringing 40 million rows to a single instance is quite expensive. It really depends on what you want to do. What I recommend is trying to use dataflow implemented in apache beam or have some cluster to run your analyzes, such as dataproc. For the last I have a jupyter integrated with the master cluster and find it really useful for everyday analyzes on data.

vortex Over a year ago

Could you please give me more implementation detail for your jupyter workflow? Do you use datalab? For the big data interaction, do you set up a cluster and directly bring in the bigquery data to memory? Any documentation links would be appreciated!

|

Collectives™ on Stack Overflow

load large data from bigquery to python

1 Answer 1

9 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

9 Comments

Your Answer

Sign up or log in

Post as a guest

Related