-1

I have a DAG airflow task where it goes into a specific bucket and creates an external table from all the files inside together. I am getting the error (CSV table references column position 19, but line starting at position:325 contains only 19 columns.) when trying to query the external table in bigquery. I've looked at the specific file it is mentioning and the last field in the first row (and others) is left blank when it should be 0 if it is nothing, or a number if there is something. How can I go about working around this?

Example records from file:

enter image description here

I've scrolled over so you can see the last columns (column 19 being Total Pieces) all the other records have a number in this field or even a 0. The first record has nothing, as well as some other records. I can imagine this might happen again in other files too, so i can't manually just fix them.

create_external_table = BigQueryUpsertTableOperator(
    task_id=f"create_external_{TABLE}_table",
    dataset_id=DATASET,
    project_id=INGESTION_PROJECT_ID,
    table_resource={
        "tableReference": {"tableId": f"{TABLE}_external"},
        "externalDataConfiguration": {
            "sourceFormat": "CSV",
            "allow_quoted_newlines": True,
            "autodetect": True,
            "sourceUris": [f"gs://{ARCHIVE_BUCKET}/{DATASET}_data/*.csv"],
        },
        "labels": labeler.get_labels_bigquery_table_v2(
            target_project=INGESTION_PROJECT_ID,
            target_dataset=DATASET,
            target_table=f"{TABLE}_external",
        ),
    },
)
4
  • Hello, are you able to fill the blanks with null values in an upstream task? Can the process writing the file handle this? Commented Oct 9, 2023 at 23:09
  • @Rafa these files are sent from an outside vendor. I could create a PySpark job to open each file and fill all empty values as nulls or 0's, just wanted to see if there were any other options to do so. Commented Oct 10, 2023 at 1:06
  • Without using the operator, you could load it in pandas and use the fillna method to fill the blanks then write the file back to s3, not sure the best method to do this or how many files you have. It is definitely inefficient. Commented Oct 11, 2023 at 19:22
  • There are also docs about handling this without the BigQuery operator. Not sure if these fit your use case cloud.google.com/bigquery/docs/… Commented Oct 11, 2023 at 19:24

1 Answer 1

0

the option for the task "allow_jagged_rows" allows for records with missing data in the last column to ingestion fine when creating an external table. Or any table for that matter, my new task looks like the following:

 create_external_table = BigQueryUpsertTableOperator(
    task_id=f"create_external_{TABLE}_table",
    dataset_id=DATASET,
    project_id=INGESTION_PROJECT_ID,
    table_resource={
        "tableReference": {"tableId": f"{TABLE}_external"},
        "externalDataConfiguration": {
            "sourceFormat": "CSV",
            "allow_quoted_newlines": True,
            "allow_jagged_rows":True, ##Added option
            "autodetect": True,
            "sourceUris": [f"gs://{ARCHIVE_BUCKET}/{DATASET}_data/*.csv"],
        },
        "labels": labeler.get_labels_bigquery_table_v2(
            target_project=INGESTION_PROJECT_ID,
            target_dataset=DATASET,
            target_table=f"{TABLE}_external",
        ),
    
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.