3

I was trying to make use of the MERGE command for populating dimensions in Snowflake. To implement surrogate keys, I created a column defaulted to a sequence number that auto increments whenever a new row gets inserted. I tried a similar approach in other data warehousing platforms and it never caused any issues. However, I noticed that whenever I use the MERGE command in Snowflake, the sequence number increments for every single row processed by the MERGE command, regardless of whether it results in an UPDATE or INSERT operation.

The following is a simple example of what I'm referring to:

-- Sequence
CREATE OR REPLACE SEQUENCE seq1 START=1 INCREMENT=1;

-- Source table
CREATE OR REPLACE TABLE source_table
(
row_key int,
row_value string
);

-- Target table: Column ID uses the sequence
CREATE OR REPLACE TABLE target_table 
(
id int DEFAULT seq1.nextval,
row_key int,
row_value string
);

-- Initial data
INSERT INTO source_table VALUES 
(1,'One'),
(2,'Two'),
(3,'Three');

MERGE INTO target_table D 
USING source_table s 
ON D.row_key=s.row_key
WHEN MATCHED AND D.row_value!=s.row_value THEN UPDATE SET row_value=s.row_value 
WHEN NOT MATCHED THEN INSERT(row_key,row_value) VALUES (s.row_key,s.row_value);

After running these commands, the output table would contain these rows:

ID,ROW_KEY,ROW_VALUE
1,1,One
2,2,Two
3,3,Three

Now, let's insert a new row and run the same merge command again:

INSERT INTO source_table VALUES
(4,'Four');

MERGE INTO target_table D 
USING source_table s 
ON D.row_key=s.row_key
WHEN MATCHED AND D.row_value!=s.row_value THEN UPDATE SET row_value=s.row_value 
WHEN NOT MATCHED THEN INSERT(row_key,row_value) VALUES (s.row_key,s.row_value);

This time, the output of the table looks like this: ID,ROW_KEY,ROW_VALUE
1,1,One
2,2,Two
3,3,Three
7,4,Four

If I insert another row, the next MERGE command will insert the new row with its ID set to 12 and the same goes on and on. It looks as if the MERGE command increments the sequence number for each row it reads from the source table, even if they don't end up being inserted into the target table at all.

Is this intentional behaviour? I tried the IDENTITY functionality instead of the sequence and it didn't change the output.

The workaround I came with was to replace the MERGE command with multiple UPDATE and INSERT statements instead, but I'm still keen to know the reason behind this behaviour.

1
  • 1
    As a note, IDENTITY just leverages SEQUENCE objects, so they will not behave differently. Commented Feb 16, 2020 at 16:48

4 Answers 4

2

This is a known issue which the Snowflake development team is working on it. As you mentioned, the workaround is to replace the MERGE command with multiple UPDATE and INSERT statements.

Sign up to request clarification or add additional context in comments.

1 Comment

As others have said here the sequence holes may be regarded as an inconvenience, but not a bug. To plug the holes an additional final central sync step may be needed where otherwise processing could be done completely in parallel. If you need tight sequences, use ROW_NUMBER() or similar.
2

Per the Snowflake documentation, Snowflake does not guarantee there will be no gaps in sequences. https://docs.snowflake.net/manuals/user-guide/querying-sequences.html.

Comments

2

You probably did this on other transactional databases (Oracle, SQL Server). If you did this on warehousing/analytic databases (like Netezza) you would also find similar sequence behavior; this is because these systems are built for speed and bulk processing; so it gets a chunk of sequence values, which it may or may not use. This does leave gaps; but given the max value of the sequence and your workflow, is it something where you will hit the ceiling in 30 o 300 years? Arguably both are don't cares.

These analytic databases typically have a higher inherent cost to simply run any query; which is very tiny on a transactional database. So they can get away with asking for a sequence value simply every time they need one (no holes!) - you can very easily see a big difference by doing individual inserts - which you may already know are discouraged by Snowflake. Here's a simple test though: create a table and make 200 insert statements each inserting a single row. Run this in mysql on your laptop; run this on a Medium sized Snowflake (or XS, but just making a point) - the mysql on the laptop is simply going to crush Snowflake for this particular test; because it is something it is designed to do. There will be a massive difference in time for an individual insert and you'll see how quickly that accumulates even doing just a small batch of 200 rows.

Note that merge itself is a pretty transactional command, and on these types of databases has also not always been supported. It may or may not be faster to simply do the individual operations yourself; as noted, you would probably end up with holes still between separate new file runs, but, within your single operations, be able to expect sequential sequences be allocated with no gap.

update target from source where business key exists in target;
insert into target from soure where business key not exists in target;

the update is actually a delete+insert, if you retain sequences in some kind of sequence - business key map, you might be able to simplify (maybe speed up?) the process too;

insert sequence, key into map where key in source and not in target;
begin;
delete from target where key exists in source and target;
insert source joined to map on key to retrieve sequence into target;
commit;

might be worth considering if the actual update can get a lot uglier. (also a curious speed experiment)

1 Comment

I can assure you that Oracle may have holes in their sequences, too. I suppose it depends on the database edition, if you use Enterprise Edition with parallel server you may get holes. There is always a sync related performance penalty for filling the gaps.
1

From the documentation, Snowflake does not guarantee generating sequence numbers with no gaps. Sequences will wrap around after reaching the maximum positive integer value for the data type.

You could try to use row_number() as a workaround https://docs.snowflake.net/manuals/sql-reference/functions/row_number.html

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.