1

I have a very big gz zipped file of JSON data. Due to some limitations, I am not able to extract and transform the data. Now the JSON data itself is very dynamic in nature.

For example:

{ name: 'yourname', 'age': 'your age', schooling: {'high-school-name1': 'span of years studied'}}
{ name: 'yourname', 'age': 'your age', schooling: {'high-school-name2': 'span of years studied'}}

The problem is the high-school-name field is a dynamic one, which will be different for different sets of users.

Now when I am uploading to bigquery, I am not able to determine which type I should specify for the schooling field or how to handle this upload to bigquery.

I am using Cloud function to automate the flow, so as soon as the file is uploaded to Cloud Storage it will trigger the function. As the cloud function has very low memory storage, there is no way to transform the data there. I have looked into dataprep for it, but I am trying to understand if I am missing something which could make what I am trying to do possible without using any other services.

3
  • Did you try to upload the file as is in BigQuery with a schema auto-detect? If it fails, do you know how many different value do you have for high-school-name? Commented Jul 8, 2020 at 15:53
  • It's virtually impossible to tell how many we will have. As you can understand that we cannot capture all the school names and even if we do, it will keep on increasing over time. Commented Jul 9, 2020 at 8:05
  • Do you want to standardize the column name of the dynamic field high-school-nameXXX, I mean, to have a unique column name (high-school-name) with the value in it span of years studied? Which result do you wish/expect? Can you update your question with this detail? Commented Jul 9, 2020 at 10:22

1 Answer 1

1

According to the documentation Loading JSON data from Cloud Storage and Specifying nested and repeated columns I think you, in deed, need a process step that could be well covered either with Dataproc or Dataflow.

You can implement a pipeline to transform your dynamic data as needed and to write to BigQuery. This doc might be of your interest. There is a template source cpde in which you can address to put a json into a BigQuery table. Here is the documentation about loading json data from cloud storage.

Please note that one if the limitations is:

If you use gzip compression BigQuery cannot read the data in parallel. Loading compressed JSON data into BigQuery is slower than loading uncompressed data.

This is one of the reasons why I think you have to implement your solution with an additional product, as you mentioned.

Sign up to request clarification or add additional context in comments.

1 Comment

Joss the ideas mentioned in your answer are pretty good ways, but for me importing the JSON as CSV in bigquery and then running another job to flatten it worked.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.