2

I need some options.

I have a table layed out as follows with about 78,000,000 rows...

  • id INT (Primary Key)
  • loc VARCHAR (Indexed)
  • date VARCHAR (Indexed)
  • time VARCHAR
  • ip VARCHAR
  • lookup VARCHAR

Here is an example of a query I have.

SELECT lookup, date, time, count(lookup) as count FROM dnstable
WHERE STR_TO_DATE(`date`, '%d-%b-%Y') >= '$date1' AND STR_TO_DATE(`date`, '%d-%b-%Y')   <= '$date2' AND
time >= '$hour1%' AND time <= '$hour2%' AND
`loc` LIKE '%$prov%' AND
lookup REGEXP 'ca|com|org|net' AND
lookup NOT LIKE '%.arpa' AND
lookup NOT LIKE '%domain.ca' AND 
ip NOT LIKE '192.168.2.1' AND
ip NOT LIKE '192.168.2.2' AND
ip NOT LIKE '192.168.2.3'
GROUP BY lookup
ORDER BY count DESC
LIMIT 100

I have my mysql server configured like a few high useage examples I found. The hardware is good, 4 cores, 8 gig rams.

This query takes about 180 seconds... Does anyone have some tips on making this more efficent?

4 Answers 4

3

There are a lot of things wrong here. A LOT of things. I would look to the other answers for query options (you use a lot of LIKES, NOT LIKES, and functions....and you're doing them on unkeyed columns...). If I were in your case, I'd redesign my entire database. It looks as though you're using this to store DNS entries - host names to IP addresses.

You may not have the option to redesign your database - maybe it's a customer database or something that you don't have control over. Maybe they have a lot of applications which depend on the current database design. However, if you can refactor your database, I would strongly suggest it.

Here's a basic rundown of what I'd do:

  1. Store the TLDs (top-level-domains) in a separate column as an ENUM. Make it an index, so it's easily searchable, instead of trying to regex .com, .arpa, etc. TLDs are limited anyway, and they don't change often, so this is a great candidate for an ENUM.

  2. Store the domain without the TLD in a regular column and a reversed column. You could index both columns, but depending on your searches, you may only need to index the reverse column. Basically, having a reverse column allows you to search for all hosts in one domain (ex. google) without having to do a fulltext search each time. MySQL can do a key search on the string "elgoog" in the reverse column. Because DNS is a hierarchy, this fits perfectly.

  3. Change the date and time columns from VARCHAR to DATE and TIME, respectively. This one's an obvious change. No more str_to_time, str_to_date, etc. Absolutely no point in doing that.

  4. Store the IP addresses differently. There's no reason to use a VARCHAR here - it's inefficient and doesn't make sense. Instead, use four separate columns for each octet (this is safe because all IPv4 addresses have four octets, no more, no less) as unsigned TINYINT values. This will give you 0-255, the range you need. (Each IP octet is actually 8 bits, anyway.) This should make searches much faster, especially if you key the columns.

    ex: select * from table where octet1 != 10; (this would filter out all 10.0.0.0/8 private IP space)

The basic problem here is that your database design is flawed - and your query is using columns that aren't indexed, and your queries are inefficient.

If you're stuck with the current design....I'm not sure if I can really help you. I'm sorry.

Sign up to request clarification or add additional context in comments.

3 Comments

Thank you for the suggestion. I am looking on a db redesign right now. The only problem I am going to have is configuing the data for insert.
Yeah....with 78 million rows, that can't be fun. I wish you the best of luck, though. The IP addresses won't be too bad, as you can split them by delimiter. I don't really know what your lookup column or location column is, though. Is this specifically for DNS name resolution, or is the "loc" column something different entirely?
+1 for best suggestions here, but I would not recommend to index "ip octet"-columns, since the maximum possible cardinality (256+NULL) is too low for being effective with standard B-Indexes. When selection should be done only over such octets and over huge amounts of data then I would consider to partitionate horizontal. To answer the question about performance of this specific statement I think doing point 3.) is enough to get results within millis.
2

I bet the really big problem here are the STR_TO_DATE functions. If possible then try date column to really have a DATE datatype. (DATE, DATETIME, TIMESTAMP)

Having this new or altered column (with date datatype) indexed would speed up the selection over that column significant. You have to avoid the date parsing which currently lacked by wrong datatype for column 'date'. This parsing/converting avoids MySQL from using the index on the 'date' column.

Conclusion: Make 'date' column having a Date datatype, have this column indexed and do not use STR_TO_DATE in your statement.


I insinuate that these local ip addresses are not very selective when used with negation, right? (This depends on the typical data in the table.) Since ip-column is not indexed, selections on that column always result in full table scan. When unequal (<>) selection on ip is very selective then consider putting an index on it and change statement to not use 'like' but <> instead. But I do not think that unequal-selection on ip is very selective.

Conclusion: I do not think you can win anything significant here.

4 Comments

These are good, but two more: (1) change REGEXP to an IN clause; (2) if this is your only use of lookup, store reverse(lookup) and index it. When the % is at the end of LIKE the index can be used. (Change the group by to match.) If you also need straight lookup, use a computed column so you can index from both ends. [Some RDBMS can use a functional index on reverse(lookup) but I do not believe MySQL is one of them!]
@Andrew Agreed. I would add (3) don't order by count, as count is defined as an alias and SQL has to order it by lookup, date, time, and count(). If there's a simpler way, choose it. (4) Change the IP 'NOT LIKE' statements to '!=' statements. edit: This was in Ish Kumar's answer, never mind.
@Andrew Lazarus Yep. Quite good. +1 Is it common here that I take suggestions from comments and edit my answer or is it not?
I pretty much a noob here too. I don't think I would edit your answer now, as it would make my comment pretty confusing!
0

The problem is that a LIKE will mean full table traversal! Which is why you are seeing this. First thing I would suggest is get rid of LIKE '192.168.2.1' since really that is the same as ='192.168.2.1' Also the fact that you set the LIMIT 100 at the end means that the query will run against all the records then select only the first 100 -- how about instead you do a SELECT which only involves all the other operations but not LIKE and limit this one and then have a second SELECT which uses LIKE?

1 Comment

Breaking up the query and limiting the first one probably won't work too well, since not all of the records returned in the first query will meet the criteria of the second query.
0

Some Tips

  • Use != instead of NOT LIKE
  • Avoid REGEXP in mysql query
  • Avoid STR_TO_DATE(date, '%d-%b-%Y') >= '$date1' try passing the MySQL formatted date to query rather converting them with STR_TO_DATE
  • lookup should be indexed if you have to use group by on it.

Try caching the query results(if possible).

1 Comment

-1. Some of the suggestions are good, namely, != for NOT LIKE and avoid STR_TO_DATE. But the other suggestions are weird. There is no problem with order by and group by in the same query; the RDBMS knows how to combine them. And how can you skip count(lookup) if that's part of the data he needs to retrieve?! And I don't see any gain to moving logic to the application. DBs are good at filtering and sorting. That's what they are built for.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.