1

I have a Google Cloud Storage bucket where a legacy system drops NEW_LINE_DELIMITED_JSON files that need to be loaded into BigQuery.

I wrote a Google Cloud Function that takes the JSON file and loads it up to BigQuery. The function works fine with sample JSON files - the problem is the legacy system is generating a JSON with a non-standard key:

{
  "id": 12345,
  "@address": "XXXXXX"
  ...
}

Of course the "@address" key throws everything off and the cloud function errors out ...

Is there any option to "ignore" the JSON fields that have non-standard keys? Or to provide a mapping and ignore any JSON field that is not in the map? I looked around to see if I could deactivate the autodetect and provide my own mapping, but the online documentation does not cover this situation.

I am contemplating the option of:

  • Loading the file in memory into a string var
  • Replace @address with address
  • Convert the json new line delimited to a list of dictionaries
  • Use bigquery stream insert to insert the rows in BQ

But I'm afraid this will take a lot longer, the file size may exceed the max 2Gb for functions, deal with unicode when loading file in a variable, etc. etc. etc.

What other options do I have?

And no, I cannot modify the legacy system to rename the "@address" field :(

Thanks!

2
  • What exactly is the error you're getting? Instead of kicking off the load job directly from the Cloud Function, it would be prudent to have the Cloud Functions kick off a Dataflow pipeline (template) and have that do the load. Commented Jan 12, 2018 at 7:56
  • hey @GrahamPolley sorry, yes, the error is the one you mention in the answer. Let me read the answer and I'll get back to you - thanks! Commented Jan 12, 2018 at 19:27

1 Answer 1

1

I'm going to assume the error that you are getting is something like this:

Errors: query: Invalid field name "@address". Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long.

This is an error message on the BigQuery side, because cols/fields in BigQuery have naming restrictions. So, you're going to have to clean your file(s) before loading them into BigQuery.

Here's one way of doing it, which is completely serverless:

  1. Create a Cloud Function to trigger on new files arriving in the bucket. You've already done this part by the sounds of things.
  2. Create a templated Cloud Dataflow pipeline that is trigged by the Cloud Function when a new file arrives. It simply passes the name of the file to process to the pipeline.
  3. In said Cloud Dataflow pipeline, read the JSON file into a ParDo, and using a JSON parsing library (e.g. Jackson if you are using Java), read the object and get rid of the "@" before creating your output TableRow object.
  4. Write results to BigQuery. Under the hood, this will actually invoke a BigQuery load job.

To sum up, you'll need the following in the conga line:

File > GCS > Cloud Function > Dataflow (template) > BigQuery

The advantages of this:

  1. Event driven
  2. Scalable
  3. Serverless/no-ops
  4. You get monitoring alerting out of the box with Stackdriver
  5. Minimal code

See:

disclosure: the last link is to a blog which was written by one of the engineers I work with.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks Graham - I'm building the pipeline as suggested, although I'm writing the dataflow in Python and having issues with the templated jobs: stackoverflow.com/q/48308693/399302 do you happen to know DF for Python?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.