4

I am trying to find the sourcesites that ONLY exist before a certain timestamp. This query seems very poor for the job. Any idea how to optimize or an index that might improve?

select distinct sourcesite 
  from contentmeta 
  where timestamp <= '2011-03-15'
  and sourcesite not in (
    select distinct sourcesite 
      from contentmeta 
      where timestamp>'2011-03-15'
  );

There is an index on sourcesite and timestamp, but query still takes a long time

mysql> EXPLAIN select distinct sourcesite from contentmeta where timestamp <= '2011-03-15' and sourcesite not in (select distinct sourcesite from contentmeta where timestamp>'2011-03-15');
+----+--------------------+-------------+----------------+---------------+----------+---------+------+--------+-------------------------------------------------+
| id | select_type        | table       | type           | possible_keys | key      | key_len | ref  | rows   | Extra                                           |
+----+--------------------+-------------+----------------+---------------+----------+---------+------+--------+-------------------------------------------------+
|  1 | PRIMARY            | contentmeta | index          | NULL          | sitetime | 14      | NULL | 725697 | Using where; Using index                        |
|  2 | DEPENDENT SUBQUERY | contentmeta | index_subquery | sitetime      | sitetime | 5       | func |     48 | Using index; Using where; Full scan on NULL key |
+----+--------------------+-------------+----------------+---------------+----------+---------+------+--------+-------------------------------------------------+
0

3 Answers 3

3

The subquery doesn't need the DISTINCT, and the WHERE clause on the outer query is not needed either, since you are already filtering by the NOT IN.

Try:

select distinct sourcesite
from contentmeta
where sourcesite not in (
    select sourcesite
    from contentmeta
    where timestamp > '2011-03-15'
);
Sign up to request clarification or add additional context in comments.

2 Comments

this can be done without 'Not IN'. since it is one of the most costly mysql operation
@MoyedAnsari: NOT IN is not the "most costly" operation. And this question needs either NOT IN or NOT EXISTS subquery or a LEFT JOIN - IS NULL query.
3

This should work:

SELECT DISTINCT c1.sourcesite
FROM contentmeta c1
LEFT JOIN contentmeta c2
  ON c2.sourcesite = c1.sourcesite
  AND c2.timestamp > '2011-03-15'
WHERE c1.timestamp <= '2011-03-15'
  AND c2.sourcesite IS NULL

For optimum performance, have a multi-column index on contentmeta (sourcesite, timestamp).

Generally, joins perform better than subqueries because derived tables cannot utilize indexes.

Comments

1

I find that "not in" just doesn't optimize well across many databases. Use a left outer join instead:

select distinct sourcesite 
from contentmeta cm 
left outer join
(
   select distinct sourcesite
   from contentmeta
   where timestamp>'2011-03-15'
) t
  on cm.sourcesite = t.sourcesite
where timestamp <= '2011-03-15' and t.sourcesite is null

This assumes that sourcesite is never null.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.