3

I have a custom Airbyte job that fails to normalize the data, so I need to do it manually. The following data is pulled from our HR system:


{
  "title": "My Report", 
  "fields": [{
      "id": "employeeNumber", 
      "name": "Employee #"
    }, 
    {
      "id": "firstName" 
      "name": "First Name"
    }, 
    { 
      "id": "lastName"
      "name": "Last Name"
    }], 
    "employees": [{ 
      "employeeNumber": "1234", 
      "firstName": "Ann", 
      "lastName": "Perkins" 
    }, 
    { 
      "employeeNumber": "5678", 
      "firstName": "Bob", 
      "lastName": "Builder" 
    }]
}

My current bigquery table looks like this (the json is stored as a string):

_airbyte_ab_id _airbyte_emitted_at _airbyte_data
123abc 2022-01-30 19:41:59 UTC {"title": "My Datawareouse", "fields": [ {"id": "employeeNumber", "name": "Employee_Number"}, {"id": "firstName", "name": "First_Name" }, { "id": "lastName", "name": "Last_Name"} ], "employees": [ { "employeeNumber": "1234", "firstName": "Ann", "lastName": "Perkins" }, { "employeeNumber": "5678", "firstName": "Bob", "lastName": "Builder" } ] }

I am trying to normalize the table to look like this:

_airbyte_ab_id _airbyte_emitted_at Employee_Number First_Name Last_Name
123abc 2022-01-30 19:41:59 UTC 1234 Ann Perkins
123abc 2022-01-30 19:41:59 UTC 5678 Bob Builder

How to flatten the json into columns as the example above, using SQL in bigquery? (The script will be running from dbt, but for now, I am just trying to get a valid query to run)

I should add that the actual json has far more fields, they might change, and I expect null values for things like "Middle Name" and the like. So, in a perfect world, I would not have to define each column name, but have it run dynamically by reading the "Fields" array.

2 Answers 2

1

How to flatten the json into columns as the example above, using SQL in bigquery?

Consider below approach

select _airbyte_ab_id, _airbyte_emitted_at, 
  json_value(employee, '$.employeeNumber') employeeNumber,
  json_value(employee, '$.firstName') firstName,
  json_value(employee, '$.lastName') lastName
from your_table,
unnest(json_extract_array(_airbyte_data, '$.employees')) employee         

if applied to sample data in your question - output is

enter image description here

Sign up to request clarification or add additional context in comments.

Comments

1

... in a perfect world, I would not have to define each column name, but have it run dynamically by reading the "Fields" array

For case when your have fields defined dynamically and potentially even different from row to row - i recommend considering below flattening approach

select _airbyte_ab_id, _airbyte_emitted_at, 
  md5(employee) employee_hash,
  json_value(field, "$.id") key,
  regexp_extract(employee, r'"' || json_value(field, "$.id") || '":"(.*?)"') value
from your_table,
unnest(json_extract_array(_airbyte_data, '$.employees')) employee,
unnest(json_extract_array(_airbyte_data, '$.fields')) field       

if applied to sample data in your question - output is

enter image description here

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.