How to import CSV to BigQuery using columns names from first row?

Question

I currently have an app written in appscript to import some CSV files from cloud storage into bigquery. While this is pretty simple, I am forced to specify the schema for the destination table.

What I am looking for is a way to read the CSV file and create the schema based on the column names in the first row. It is okay if all the variable types end up as strings. I feel like this is a pretty common scenario.. does anyone have any guidance on this?

Much thanks, Nick

It's been more than three years since this question asked, Is there any direct BigQuery API method available now to set the schema from an external source or load the CSV without a schema? — Vijin Paulraj
– Vijin Paulraj, Commented Aug 18, 2017 at 1:05
OMG, when I think that BigQuery allows Auto-Detect, and can't even infer or ask if column names should be the first line... As of August 2023, what a basic feature missing... — Fabien Haddadi
– Fabien Haddadi, Commented Aug 10, 2023 at 10:03

Jordan Tigani · Accepted Answer · 2014-02-19 00:11:46Z

5

+50

One option (not a particularly pleasant one, but an option) would be to make a raw HTTP request from apps script to GCS to read the first row of the data, split it on commas, and generate a schema from that. GCS doesn't have apps script integration, so you need to build the requests by hand. Apps Script does have some utilities to let you do this (as well as OAuth), but my guess is that is is going to be a decent amount of work to get right.

There are also a couple of things you could try from the BigQuery side. You could import the data to a temporary table as a single field (set the field delimiter to something that doesn't exist, like '\r'). You can read the header row via tabledata.list() (i.e. the first row of the temporary table). You can then run a query that splits up then split the single field up into columns with a regular expression, and set allow_large_results and a destination table.

One other option would be to use a dummy schema with more columns than you'll ever have, then use the allow_jagged_rows option to allow rows that are missing data at the end of the row. You can then read the first row (similar to the previous option) with tabledata.list() and figure out how many rows are actually present. Then you could generate a query that rewrites the table with correct column names. The advantage of this approach is that you don't need regular expressions or parsing; it lets bigquery do all of the CSV parsing.

There is a downside to both of the latter two approaches, however; the bigquery load mechanism does not guarantee to preserve ordering of your data. In practice, the first row should always be the first row in the table, but that isn't guaranteed to always be true.

Sorry there isn't a better solution. We've had a feature request on the table for a long time to auto-infer schemas; I'll take this as another vote for it.

answered Feb 19, 2014 at 0:11

Jordan Tigani

26.7k5 gold badges63 silver badges64 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Aviv Noy Over a year ago

And if i want to load the all text file as one big string into one row with big string column?

baxx Over a year ago

is this answer still relevant in 2019?

William Vambenepe · Accepted Answer · 2016-05-24 07:35:46Z

2

For the record, schema inference is now available: https://cloud.google.com/bigquery/federated-data-sources#auto-detect

answered May 24, 2016 at 7:35

William Vambenepe

1504 bronze badges

Comments

Ankit Sahay · Accepted Answer · 2022-06-13 08:36:39Z

2

Was facing the same issues when all my columns were of String datatype, when I added one more column (any random column) as an integer datatype, it worked. Used the option of "Auto-detect Schema" and in the Advanced Option-> Header rows to skip as 1

answered Jun 13, 2022 at 8:36

Ankit Sahay

2,06317 silver badges27 bronze badges

Comments

Aturen · Accepted Answer · 2017-05-08 21:24:06Z

1

Building off of William Vambenepe's answer, Big Query can guess at the schema now. The documentation page moved to: https://cloud.google.com/bigquery/docs/schema-detect

Note that your import can still fail, as it only looks at the first 100 rows. This can be problematic if you have a rare "NA" or "Other" in a column of seeming integers.

When this feature first came out, you could go back and change the offending Field Type on the Web UI by hand because the guesses would auto-populate the schema when you reload the failed import. It doesn't seem to do this anymore, hopefully it will return in a future update.

answered May 8, 2017 at 21:24

Aturen

1131 silver badge7 bronze badges

Collectives™ on Stack Overflow

How to import CSV to BigQuery using columns names from first row?

4 Answers 4

2 Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related