I have a DAG airflow task where it goes into a specific bucket and creates an external table from all the files inside together. I am getting the error (CSV table references column position 19, but line starting at position:325 contains only 19 columns.) when trying to query the external table in bigquery. I've looked at the specific file it is mentioning and the last field in the first row (and others) is left blank when it should be 0 if it is nothing, or a number if there is something. How can I go about working around this?
Example records from file:
I've scrolled over so you can see the last columns (column 19 being Total Pieces) all the other records have a number in this field or even a 0. The first record has nothing, as well as some other records. I can imagine this might happen again in other files too, so i can't manually just fix them.
create_external_table = BigQueryUpsertTableOperator(
task_id=f"create_external_{TABLE}_table",
dataset_id=DATASET,
project_id=INGESTION_PROJECT_ID,
table_resource={
"tableReference": {"tableId": f"{TABLE}_external"},
"externalDataConfiguration": {
"sourceFormat": "CSV",
"allow_quoted_newlines": True,
"autodetect": True,
"sourceUris": [f"gs://{ARCHIVE_BUCKET}/{DATASET}_data/*.csv"],
},
"labels": labeler.get_labels_bigquery_table_v2(
target_project=INGESTION_PROJECT_ID,
target_dataset=DATASET,
target_table=f"{TABLE}_external",
),
},
)

fillnamethod to fill the blanks then write the file back to s3, not sure the best method to do this or how many files you have. It is definitely inefficient.