4

I have a pandas DataFrame with 3 columns: col1 contains lists, col2 contains dictionaries, and col3 contains NaNs:

dict_ = {'col1': [['abc'], ['def', 'ghi'], []],
         'col2': [{'k1': 'v1', 'k2': 'v2'},
                  {'k1': 'v3', 'k2': 'v4'},
                  {'k1': 'v5', 'k2': 'v6'}],
         'col3': [np.nan, np.nan, np.nan]}
df = pd.DataFrame(dict_)

Uploading the DataFrame to BigQuery I create the following schema for the first and second columns:

schema = [
bigquery.SchemaField(name="col1", field_type="STRING", mode='REPEATED'),
bigquery.SchemaField(name="col2", field_type="RECORD", mode='NULLABLE',
                     fields=[bigquery.SchemaField(name="k1", field_type="STRING", mode='NULLABLE'),
                             bigquery.SchemaField(name="k2", field_type="STRING", mode='NULLABLE')])
]
job_config = bigquery.LoadJobConfig(write_disposition="WRITE_TRUNCATE", schema=schema)
job = client.load_table_from_dataframe(df, table, job_config=job_config)
job.result()

The DataFrame was uploaded, but the col1 is empty.

Table Preview : enter image description here

What should I do to fix this?

1 Answer 1

10

The load_table_from_dataframe method in Python client library for BigQuery serializes a DataFrame to Parquet. Unfortunately the BigQuery backend has limited support for the array data type.

As a workaround, I recommend the insert_rows_from_dataframe method.

import pandas as pd
import numpy as np
from google.cloud import bigquery


dict_ = {'col1': [['abc'], ['def', 'ghi'], []],
         'col2': [{'k1': 'v1', 'k2': 'v2'},
                  {'k1': 'v3', 'k2': 'v4'},
                  {'k1': 'v5', 'k2': 'v6'}],
         'col3': [np.nan, np.nan, np.nan]}
df = pd.DataFrame(dict_)

client = bigquery.Client()

schema = [
    bigquery.SchemaField(name="col1", field_type="STRING", mode='REPEATED'),
    bigquery.SchemaField(name="col2", field_type="RECORD", mode='NULLABLE',
                     fields=[bigquery.SchemaField(name="k1", field_type="STRING", mode='NULLABLE'),
                             bigquery.SchemaField(name="k2", field_type="STRING", mode='NULLABLE')])
]
table = bigquery.Table(
    "my-project.my_dataset.stackoverflow66054651",
    schema=schema
)
client.create_table(table)

errors = client.insert_rows_from_dataframe(table, df)
for chunk in errors:
    print(f"encountered {len(chunk)} errors: {chunk}")

loaded_df = client.query(
    # Use a query so that data is read from streaming buffer.
    "SELECT * FROM `my-project.my_dataset.stackoverflow66054651`"
).to_dataframe()
print(loaded_df)

Resources:

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.