Index on column with 70% of empty values: Use null or empty value?

Question

We are currently optimizing a MySQL table (InnoDB) that will eventially have more than 100 million rows.

In one column, we are storing IP addresses (VARCHAR 45). We need to put an index on this column, as we have to be able to retrieve all rows per specific IP address.

70% of all rows, however, will not store an IP address (empty).

Our question: Shall we store those empty values as NULL and thus ALLOW NULL on this column (will add 1 byte to each row). Or shall we NOT ALLOW NULL and store those empty values as '' (empty string)?

What is best for performance?

We will never have to search rows that are empty (= '') or null (IS NULL), only search for specific IP addresses (= '123.456.789.123').

Update: There are indeed many questions on SO that address similar scenarios. However, some answers seem to be contradictory or say "it depends". We will run some tests and post our findings for our specific scenario here.

I would imagine the empty string would be slightly more performant purely as it uses less storage space. The index would be basically the same either way. The best solution is the one you TEST and verify is quicker — Grantly
– Grantly, Commented Dec 19, 2015 at 14:24
@Shadow Yes, seems like a similar question - but at first glance it seems to me the two highest-scoring answers say the opposite? One says "use null", the other one says "don't use null!". — Mörre
– Mörre, Commented Dec 19, 2015 at 14:32
The 2 highest scoring answers actually don't say definite yes or no. The 3rd answer is definite about indexing. — Shadow
– Shadow, Commented Dec 19, 2015 at 14:38
@Lionel 1. Inet6_aton() is available in v5.6 and using inet6_ntoa() you can easily convert the numeric form back to human readable. 2. In optimization related questions you very rarely get straight answers. You are not going to get one here either. The other topic lists all points you need to consider, then you need to evaluate in your specific environment and with your data what works better. 3. What is more important to you: speed or data storage? — Shadow
– Shadow, Commented Dec 19, 2015 at 14:51

Rick James · Accepted Answer · 2015-12-19 23:42:06Z

2

VARCHAR(39) is sufficient for both IPv4 (the old format, for which there are no more values available) and IPv6.

The optimizer may screw up if 70% of the values are the same ('' or NULL). I suggest you have another table with the IP and an ID for JOINing back to your original table. By having no 'empty' IPs in the second table, the optimizer is more likely to "do the right thing".

With that, LEFT JOIN can be used to see if there is an IP.

IPv6 can be stored in BINARY(16) to save space.

answered Dec 19, 2015 at 23:42

Rick James

144k15 gold badges144 silver badges255 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

user207421 Over a year ago

If the second table didn't have any empty IP addresses you would have to use null as the foreign key to it, which gets you back where you started.

Rick James Over a year ago

That's an argument against FKs. They aren't useful in all situations.

Arth Over a year ago

@EJP You misunderstood. Rick is suggesting a has one relationship, where the new table has a reference back to the original. There would be no IP or IP_id column in the original table.

manchicken · Accepted Answer · 2016-03-25 16:56:14Z

1

Go with NULL values. InnoDB has no space cost for NULLs, and NULL values are excluded from indexes, so you'll have a faster index lookup for the values which are present.

As far as how you store the IP itself (string verus number), that seems like a far less important point of optimization.

answered Mar 25, 2016 at 16:56

manchicken

1556 bronze badges

1 Comment

Jonathan Parent Lévesque Over a year ago

Interesting claim, but without proper references, it's for me hard approve.

Øystein Grøvlen · Accepted Answer · 2015-12-21 12:04:43Z

0

The main difference between NULL and an empty string is related to comparing values. Two empty strings are considered equal. Two NULL values are not. For example, if you want to join two tables based on IP-value columns, the result will be quite different for NULL and empty strings, and most likely you want the behavior of NULL.

If you only are going to search for specific IP-adresses, using NULL or empty string should not matter. If the IP-value column is indexed, the optimizer will obtain an estimate from InnoDB on the number of rows with the specific value. The general statistics on number of rows per value will not be used in this case.

Avoiding NULL values will save you 30 MB on 100 million rows when 70% of the rows are NULL. (For rows where the value is an empty string, you will not save any space since you will need one byte to store the length information instead.) Compared to what you can save by storing IP values as a binary string, this is nothing, and I do not think storage overhead is a valid concern.

answered Dec 21, 2015 at 12:04

Øystein Grøvlen

1,3657 silver badges10 bronze badges

2 Comments

manchicken Over a year ago

The space cost of NULL values is only relevant in MyISAM. InnoDB has no space cost for NULLs.

Øystein Grøvlen Over a year ago

InnoDB row headers contains a bit vector over columns that are NULL. If there are no NULL columns, the row header will not contain this bit vector. Hence, a table without NULL columns will use 1 byte less per row than the same table with 1-8 NULL columns. See dev.mysql.com/doc/refman/5.7/en/innodb-physical-record.html

Collectives™ on Stack Overflow

Index on column with 70% of empty values: Use null or empty value?

3 Answers 3

3 Comments

1 Comment

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related