Handling dynamic keys while loading json data in bigquery

Question

I have a very big gz zipped file of JSON data. Due to some limitations, I am not able to extract and transform the data. Now the JSON data itself is very dynamic in nature.

For example:

{ name: 'yourname', 'age': 'your age', schooling: {'high-school-name1': 'span of years studied'}}
{ name: 'yourname', 'age': 'your age', schooling: {'high-school-name2': 'span of years studied'}}

The problem is the high-school-name field is a dynamic one, which will be different for different sets of users.

Now when I am uploading to bigquery, I am not able to determine which type I should specify for the schooling field or how to handle this upload to bigquery.

I am using Cloud function to automate the flow, so as soon as the file is uploaded to Cloud Storage it will trigger the function. As the cloud function has very low memory storage, there is no way to transform the data there. I have looked into dataprep for it, but I am trying to understand if I am missing something which could make what I am trying to do possible without using any other services.

Did you try to upload the file as is in BigQuery with a schema auto-detect? If it fails, do you know how many different value do you have for high-school-name? — guillaume blaquiere
– guillaume blaquiere, Commented Jul 8, 2020 at 15:53
It's virtually impossible to tell how many we will have. As you can understand that we cannot capture all the school names and even if we do, it will keep on increasing over time. — Varun
– Varun, Commented Jul 9, 2020 at 8:05
Do you want to standardize the column name of the dynamic field high-school-nameXXX, I mean, to have a unique column name (high-school-name) with the value in it span of years studied? Which result do you wish/expect? Can you update your question with this detail? — guillaume blaquiere
– guillaume blaquiere, Commented Jul 9, 2020 at 10:22

Joss Baron · Accepted Answer · 2020-07-08 15:55:16Z

1

According to the documentation Loading JSON data from Cloud Storage and Specifying nested and repeated columns I think you, in deed, need a process step that could be well covered either with Dataproc or Dataflow.

You can implement a pipeline to transform your dynamic data as needed and to write to BigQuery. This doc might be of your interest. There is a template source cpde in which you can address to put a json into a BigQuery table. Here is the documentation about loading json data from cloud storage.

Please note that one if the limitations is:

If you use gzip compression BigQuery cannot read the data in parallel. Loading compressed JSON data into BigQuery is slower than loading uncompressed data.

This is one of the reasons why I think you have to implement your solution with an additional product, as you mentioned.

answered Jul 8, 2020 at 15:55

Joss Baron

1,52410 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Varun Over a year ago

Joss the ideas mentioned in your answer are pretty good ways, but for me importing the JSON as CSV in bigquery and then running another job to flatten it worked.

Collectives™ on Stack Overflow

Handling dynamic keys while loading json data in bigquery

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related