3

I have a python dictionary that looks like this:

{
'id': 123, 
'categories': [
    {'category': 'fruit', 'values': ['apple', 'banana']}, 
    {'category': 'animal', 'values': ['cat']},
    {'category': 'plant', 'values': []}
  ]
}

I am trying to insert those values into a table in big query via the API using python, I just need to format the above into an "INSERT table VALUES" query. The table needs to have the fields: id, categories.category, categories.values.

I need categories to basically be an array with the category and each category's corresponding values. The table is supposed to look sort of like this in the end - except I need it to be just one row per id, with the corresponding category fields nested and having the proper field name:

enter image description here

SELECT 123 as id, (["fruit"], ["apple", "banana"]) as category
UNION ALL (SELECT 123 as id, (["animal"], ["cat"]) as category)
UNION ALL (SELECT 123 as id, (["plant"], ["tree", "bush", "rose"]) as category)

I'm not really sure how to format the "INSERT" query to get the desired result, can anyone help?

2 Answers 2

11

If you want to load a dictionary to BigQuery using Python, you have first to prepare your data. I chose to convert the Python Dictionary to a .json file and then load it to BigQuery using the Python API. However, according to the documentation, BigQuery has some limitations regarding loading .json nested data, among them:

  1. Your .json must be a new line delimited, which means that each object must be in a new line in the file
  2. BigQuery does not support maps or dictionaries in Json. Thus, in order to do so, you have to wrap your whole data in [], as you can see here.

For this reason, some modifications should be done in the file, so you can load the created .json file to BiguQuery. I have created two scripts, in which: the first one converts the Python dict to a JSON file and the second the JSON file is formatted as New Line delimited json and then loaded in BigQuery.

Convert the python dict to a .json file. Notice that you have to wrap the whole data between []:

import json
from google.cloud import bigquery


py_dict =[{
'id': 123, 
'categories': [
    {'category': 'fruit', 'values': ['apple', 'banana']}, 
    {'category': 'animal', 'values': ['cat']},
    {'category': 'plant', 'values': []}
  ]
}]

json_data = json.dumps(py_dict, sort_keys=True)
out_file =  open('json_data.json','w+')
json.dump(py_dict,out_file)

Second, convert json to new line delimited json and load to BigQuery:

import json
from google.cloud import bigquery

with open("json_data.json", "r") as read_file:
    data = json.load(read_file)
result = [json.dumps(record) for record in data]
with open('nd-proceesed.json', 'w') as obj:
    for i in result:
        obj.write(i+'\n')
       

client = bigquery.Client()
filename = '/path/to/file.csv'
dataset_id = 'sample'
table_id = 'json_mytable'

dataset_ref = client.dataset(dataset_id)
table_ref = dataset_ref.table(table_id)
job_config = bigquery.LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
job_config.autodetect = True

with open("nd-proceesed.json", "rb") as source_file:
    job = client.load_table_from_file(source_file, table_ref, job_config=job_config)

job.result()  # Waits for table load to complete.

print("Loaded {} rows into {}:{}.".format(job.output_rows, dataset_id, table_id))

Then, in the BigQuery UI, you can query your table as follows:

SELECT id, categories
FROM `test-proj-261014.sample.json_mytable4` , unnest(categories) as categories

And the output:

enter image description here

Sign up to request clarification or add additional context in comments.

Comments

1

You can use below query - with your dictionary text embed into it

#standardSQL
WITH data AS (
SELECT '''
  {
  'id': 123, 
  'categories': [
      {'category': 'fruit', 'values': ['apple', 'banana']}, 
      {'category': 'animal', 'values': ['cat']},
      {'category': 'plant', 'values': ['tree', 'bush', 'rose']}
    ]
  } 
  ''' dict
)
SELECT 
  JSON_EXTRACT_SCALAR(dict, '$.id') AS id,
  ARRAY(
    SELECT AS STRUCT 
      JSON_EXTRACT_SCALAR(cat, '$.category') AS category,
      ARRAY(
        SELECT TRIM(val, '"')
        FROM UNNEST(JSON_EXTRACT_ARRAY(cat, '$.values')) val
      )`values`
    FROM UNNEST(JSON_EXTRACT_ARRAY(dict, '$.categories')) cat
  ) AS categories 
FROM data   

which produces below result

Row id  categories.category categories.values    
1   123 fruit               apple    
                            banana   
        animal              cat  
        plant               tree     
                            bush     
                            rose     

1 Comment

How do I do the query embedding for multiple rows please? The above only recognises one row.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.