Last Field in a CSV row left empty giving error: CSV table references column position 19, but line starting at position:325 contains only 19 columns

Question

I have a DAG airflow task where it goes into a specific bucket and creates an external table from all the files inside together. I am getting the error (CSV table references column position 19, but line starting at position:325 contains only 19 columns.) when trying to query the external table in bigquery. I've looked at the specific file it is mentioning and the last field in the first row (and others) is left blank when it should be 0 if it is nothing, or a number if there is something. How can I go about working around this?

Example records from file:

I've scrolled over so you can see the last columns (column 19 being Total Pieces) all the other records have a number in this field or even a 0. The first record has nothing, as well as some other records. I can imagine this might happen again in other files too, so i can't manually just fix them.

create_external_table = BigQueryUpsertTableOperator(
    task_id=f"create_external_{TABLE}_table",
    dataset_id=DATASET,
    project_id=INGESTION_PROJECT_ID,
    table_resource={
        "tableReference": {"tableId": f"{TABLE}_external"},
        "externalDataConfiguration": {
            "sourceFormat": "CSV",
            "allow_quoted_newlines": True,
            "autodetect": True,
            "sourceUris": [f"gs://{ARCHIVE_BUCKET}/{DATASET}_data/*.csv"],
        },
        "labels": labeler.get_labels_bigquery_table_v2(
            target_project=INGESTION_PROJECT_ID,
            target_dataset=DATASET,
            target_table=f"{TABLE}_external",
        ),
    },
)

Hello, are you able to fill the blanks with null values in an upstream task? Can the process writing the file handle this? — Rafa
– Rafa, Commented Oct 9, 2023 at 23:09
@Rafa these files are sent from an outside vendor. I could create a PySpark job to open each file and fill all empty values as nulls or 0's, just wanted to see if there were any other options to do so. — unnest_me
– unnest_me, Commented Oct 10, 2023 at 1:06
Without using the operator, you could load it in pandas and use the fillna method to fill the blanks then write the file back to s3, not sure the best method to do this or how many files you have. It is definitely inefficient. — Rafa
– Rafa, Commented Oct 11, 2023 at 19:22
There are also docs about handling this without the BigQuery operator. Not sure if these fit your use case cloud.google.com/bigquery/docs/… — Rafa
– Rafa, Commented Oct 11, 2023 at 19:24

unnest_me · Accepted Answer · 2023-10-12 18:20:31Z

the option for the task "allow_jagged_rows" allows for records with missing data in the last column to ingestion fine when creating an external table. Or any table for that matter, my new task looks like the following:

 create_external_table = BigQueryUpsertTableOperator(
    task_id=f"create_external_{TABLE}_table",
    dataset_id=DATASET,
    project_id=INGESTION_PROJECT_ID,
    table_resource={
        "tableReference": {"tableId": f"{TABLE}_external"},
        "externalDataConfiguration": {
            "sourceFormat": "CSV",
            "allow_quoted_newlines": True,
            "allow_jagged_rows":True, ##Added option
            "autodetect": True,
            "sourceUris": [f"gs://{ARCHIVE_BUCKET}/{DATASET}_data/*.csv"],
        },
        "labels": labeler.get_labels_bigquery_table_v2(
            target_project=INGESTION_PROJECT_ID,
            target_dataset=DATASET,
            target_table=f"{TABLE}_external",
        ),

Collectives™ on Stack Overflow

Last Field in a CSV row left empty giving error: CSV table references column position 19, but line starting at position:325 contains only 19 columns

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related