0

I have about 40 odd csv files, comma delimited in GCS however the last line of all the files has quotes and dot

”. 

So these are not exactly conformed csv schema and has data quality issue which i have to get around

My aim is to create an external table referencing to the gcs files and then be able to select the data.

example:

create or replace dataset.tableName 
options (
  uris = ['gs://bucket_path/allCSVFILES_*.csv'],
  format = 'CSV',
  skip_leading_rows = 1,
  ignore_unknown_values = true
)

the external table gets created without any error. however, when I select the data, I ran to error

"error message: CSV table references column position 16, but line starting at position:18628631 contains only 1 columns"

This is due to quotes and dot ”. at the end of file.

My question is: is there any way in BigQuery to consume to data without the LAST LINE. as part of options we have skip_leading_rows to skip header but any way to skip to last row?

Currently my best placed option is to clean the files, using sed/tail command.

I have checked the create or replace external table options list below and have tried using ignore_unknown_values but other than this option i don't see any other option which will work.

https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#create_external_table_statement

1
  • Hi @Bihag Kashikar If my answer addressed your question, please consider accepting and upvoting it.Accepting an answer will help the community members with their research as well. Commented Feb 1, 2023 at 13:21

1 Answer 1

1

You can try below work around:

I tried with pandas and removed the last record from the csv file.

from google.cloud import bigquery
import pandas as pd
from google.cloud import storage

df=pd.read_csv('gs://samplecsv.csv')
client = bigquery.Client()
dataset_ref = client.dataset('dataset')
table_ref = dataset_ref.table('new_table')

df.drop(df.tail(1).index,inplace=True)
client.load_table_from_dataframe(df, table_ref).result()

For more information you can refer to this link which mentions the limitation for loading csv files to Bigquery.

Sign up to request clarification or add additional context in comments.

1 Comment

I ended up cleaning the files on gce using sed '$d'

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.