2

We are looking at implementing Schema on Read to load data onto snowflake tables. We receive .csv files in an AWS S3 path which will be the source for our tables. But the structure of these feed files change often and we don't want to manually alter the already created table, every time the schema of a file is changed. We want to automate the entire process of loading the file onto the Snowflake table based on how schema changes for a file and we don't want to create a new table every time a new attribute is added/ removed to the file. It would be really helpful if someone can help us with their suggestions on how better this can be implemented.

3 Answers 3

2

There is a feature that allows to infer the schema and create a table:

CREATE TABLE … USING TEMPLATE:

CREATE TABLE … USING TEMPLATE

Creates a new table with the column definitions derived from a set of staged files containing semi-structured data. This feature is currently limited to Apache Parquet, Apache Avro, and ORC files.

...

This example builds on an example in the INFER_SCHEMA topic:

CREATE TABLE mytable
  USING TEMPLATE (
    SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*))
      FROM TABLE(
        INFER_SCHEMA(
          LOCATION=>'@mystage',
          FILE_FORMAT=>'my_parquet_format'
        )
      ));

INFER_SCHEMA:

Automatically detects the file metadata schema in a set of staged data files that contain semi-structured data and retrieves the column definitions. Use the column definitions to simplify the creation of a landing table or external table to query the data.

This feature is currently limited to Apache Parquet, Apache Avro, and ORC files The support for JSON and CSV files is currently in preview.

Sign up to request clarification or add additional context in comments.

2 Comments

I'm afraid he needs this for structured data (.cvs files). Not for semi-structured data.
@CMe I wrote this answer with question's topic "schema-on read" from in mind. This functionality is in preview and still evolving so csv may be added as well. But you are right as of moment of writing it does not support CSV
0

One approch could be the following :

  • Load the data into a staging table. This staging table could be created with blanck headers (col1, col2 ... colX) or on-the-go using CREATE TABLE AS SELECT $1, $2 ... $X from @stage.
  • Create a dynamic query wich will get headers from the staging table then generate an insert/merge query to load data into your final table. This can be done using a stored procedure.

Comments

0

Snowflake supports using standard SQL to query data files located in an internal (i.e. Snowflake) stage or named external (Amazon S3, Google Cloud Storage, or Microsoft Azure) stage. This can be useful for inspecting/viewing the contents of the staged files, particularly before loading or after unloading data.

In addition, by referencing [metadata columns][1] in a staged file, a staged data query can return additional information, such as filename and row numbers, about the file.

Snowflake utilizes support for staged data queries to enable [transforming data during loading][2].

More Details: https://docs.snowflake.com/en/user-guide/querying-stage.html [1]: https://docs.snowflake.com/en/user-guide/querying-metadata.html [2]: https://docs.snowflake.com/en/user-guide/data-load-transform.html

1 Comment

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.