We are looking at implementing Schema on Read to load data onto snowflake tables. We receive .csv files in an AWS S3 path which will be the source for our tables. But the structure of these feed files change often and we don't want to manually alter the already created table, every time the schema of a file is changed. We want to automate the entire process of loading the file onto the Snowflake table based on how schema changes for a file and we don't want to create a new table every time a new attribute is added/ removed to the file. It would be really helpful if someone can help us with their suggestions on how better this can be implemented.
3 Answers
There is a feature that allows to infer the schema and create a table:
CREATE TABLE … USING TEMPLATE:
CREATE TABLE … USING TEMPLATE
Creates a new table with the column definitions derived from a set of staged files containing semi-structured data. This feature is currently limited to Apache Parquet, Apache Avro, and ORC files.
...
This example builds on an example in the INFER_SCHEMA topic:
CREATE TABLE mytable
USING TEMPLATE (
SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*))
FROM TABLE(
INFER_SCHEMA(
LOCATION=>'@mystage',
FILE_FORMAT=>'my_parquet_format'
)
));
Automatically detects the file metadata schema in a set of staged data files that contain semi-structured data and retrieves the column definitions. Use the column definitions to simplify the creation of a landing table or external table to query the data.
This feature is currently limited to Apache Parquet, Apache Avro, and ORC files The support for JSON and CSV files is currently in preview.
2 Comments
One approch could be the following :
- Load the data into a staging table. This staging table could be created with blanck headers (col1, col2 ... colX) or on-the-go using CREATE TABLE AS
SELECT $1, $2 ... $X from @stage. - Create a dynamic query wich will get headers from the staging table then generate an insert/merge query to load data into your final table. This can be done using a stored procedure.
Comments
Snowflake supports using standard SQL to query data files located in an internal (i.e. Snowflake) stage or named external (Amazon S3, Google Cloud Storage, or Microsoft Azure) stage. This can be useful for inspecting/viewing the contents of the staged files, particularly before loading or after unloading data.
In addition, by referencing [metadata columns][1] in a staged file, a staged data query can return additional information, such as filename and row numbers, about the file.
Snowflake utilizes support for staged data queries to enable [transforming data during loading][2].
More Details: https://docs.snowflake.com/en/user-guide/querying-stage.html [1]: https://docs.snowflake.com/en/user-guide/querying-metadata.html [2]: https://docs.snowflake.com/en/user-guide/data-load-transform.html