Flatten json string in BigQuery

Question

I have a custom Airbyte job that fails to normalize the data, so I need to do it manually. The following data is pulled from our HR system:


{
  "title": "My Report", 
  "fields": [{
      "id": "employeeNumber", 
      "name": "Employee #"
    }, 
    {
      "id": "firstName" 
      "name": "First Name"
    }, 
    { 
      "id": "lastName"
      "name": "Last Name"
    }], 
    "employees": [{ 
      "employeeNumber": "1234", 
      "firstName": "Ann", 
      "lastName": "Perkins" 
    }, 
    { 
      "employeeNumber": "5678", 
      "firstName": "Bob", 
      "lastName": "Builder" 
    }]
}

My current bigquery table looks like this (the json is stored as a string):

_airbyte_ab_id	_airbyte_emitted_at	_airbyte_data
123abc	2022-01-30 19:41:59 UTC	{"title": "My Datawareouse", "fields": [ {"id": "employeeNumber", "name": "Employee_Number"}, {"id": "firstName", "name": "First_Name" }, { "id": "lastName", "name": "Last_Name"} ], "employees": [ { "employeeNumber": "1234", "firstName": "Ann", "lastName": "Perkins" }, { "employeeNumber": "5678", "firstName": "Bob", "lastName": "Builder" } ] }

I am trying to normalize the table to look like this:

_airbyte_ab_id	_airbyte_emitted_at	Employee_Number	First_Name	Last_Name
123abc	2022-01-30 19:41:59 UTC	1234	Ann	Perkins
123abc	2022-01-30 19:41:59 UTC	5678	Bob	Builder

How to flatten the json into columns as the example above, using SQL in bigquery? (The script will be running from dbt, but for now, I am just trying to get a valid query to run)

I should add that the actual json has far more fields, they might change, and I expect null values for things like "Middle Name" and the like. So, in a perfect world, I would not have to define each column name, but have it run dynamically by reading the "Fields" array.

Mikhail Berlyant · Accepted Answer · 2022-01-30 21:31:29Z

1

How to flatten the json into columns as the example above, using SQL in bigquery?

Consider below approach

select _airbyte_ab_id, _airbyte_emitted_at, 
  json_value(employee, '$.employeeNumber') employeeNumber,
  json_value(employee, '$.firstName') firstName,
  json_value(employee, '$.lastName') lastName
from your_table,
unnest(json_extract_array(_airbyte_data, '$.employees')) employee

if applied to sample data in your question - output is

answered Jan 30, 2022 at 21:31

Mikhail Berlyant

174k10 gold badges173 silver badges251 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Mikhail Berlyant · Accepted Answer · 2022-01-30 22:17:36Z

1

... in a perfect world, I would not have to define each column name, but have it run dynamically by reading the "Fields" array

For case when your have fields defined dynamically and potentially even different from row to row - i recommend considering below flattening approach

select _airbyte_ab_id, _airbyte_emitted_at, 
  md5(employee) employee_hash,
  json_value(field, "$.id") key,
  regexp_extract(employee, r'"' || json_value(field, "$.id") || '":"(.*?)"') value
from your_table,
unnest(json_extract_array(_airbyte_data, '$.employees')) employee,
unnest(json_extract_array(_airbyte_data, '$.fields')) field

if applied to sample data in your question - output is

answered Jan 30, 2022 at 22:17

Mikhail Berlyant

174k10 gold badges173 silver badges251 bronze badges

Collectives™ on Stack Overflow

Flatten json string in BigQuery

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related