1


Im having an issue with clustered tables in BigQuery (with date partitions). I have a table that is clustered by a column named entity_id. The thing is, i expect to see a bytes read reduction when making queries filtered by these clustered column, but according with the BigQuery Web UI it's doing a fullscan anyway.

For example:
SELECT * FROM project.usersDataset.users_cluster WHERE entity_id = '405849241' LIMIT 1000;
Returns: "Query complete (0.570 sec elapsed, 862.94 MB processed)"
This is actually the full table size (862,94 MB)

This is the table configuration: Table configuration img

EDIT: I keep going on tests and i found that sometimes, some bytes read are saved, but not too much:
Query from BigQuery Web Ui I was expecting a bigger amount of bytes cost to be saved (returned 1 entry and scanned 719MB of 862MB of the table) but nothing guaranteed these in the bigquery documentation.

Does anyone have a clue on what could be happening?
Thanks!

3
  • I asked a similar question in this link stackoverflow.com/questions/53980953/…, Can you also provide same screenshot from your web UI to help get to the bottom of this Commented Feb 19, 2019 at 18:00
  • 2
    Clustering is working only on partition tables, and kicks in generally above 1GB of data sets. Commented Feb 19, 2019 at 18:08
  • Yes Tamir, it's quite similar to your problem. Actually (as i edited the post recently) i continue testing and i found that sometimes some bytes cost reduction is been made (719MB of 862MB of the total table and returned 1 row). I suppose i was expecting a bigger cost save, but nothing guaranteed these in the bigquery documentation and as Pentium10 points out maybe the amount of data doesn't help neither. Thanks both! Commented Feb 19, 2019 at 19:02

1 Answer 1

0

From BigQuery documentation provided in this link

Features under development

Support for clustering non-partitioned tables.

Please check you table is cluster and partition

Note: Cluster will also be used when no WHERE condition per BigQuery documentation

Sign up to request clarification or add additional context in comments.

3 Comments

I forgot to add, it's a partitioned by date table clustered by entity_id field. I have recently edited the post to add the table specs. I have a doubt about your note, i thought you actually need the clustered field in the query for this to work
Check Felipe great article about cluster medium.com/google-cloud/… see in his example the where part is on the partition only and how the cluster saves cost, this was my intention sorry if it wasn't clear. Hope this document will solve your issue.
@MarcoLotto, I posted another question on a similar issue which you can find in this link. Note if you have a streaming buffer attach to your table you might need to run a daily merge command to see cost improvments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.