Well-indexed MySQL query takes long to execute. EXPLAIN shows 3m rows. Can it be sped up or improved?

Question

I have a database with over 100 million rows of reddit comment data in the format of:

{
 author: redditauthor1,
 body: example comment,
 subreddit: /r/funny,
 ....
}

I am trying to get a list of users with their respective number of comments for all the subreddits they posted in. I am also narrowing it down by users who also posted in the subreddit I pass through as a parameter.

I have 4 indexes for this single table. Reason being is that I only plan on reading from this for the time being. The indexes look like so:

CREATE INDEX idx_subreddit
ON comments(subreddit);

CREATE INDEX idx_author
ON comments(author);

CREATE INDEX idx_authsub
ON comments(author, subreddit);

CREATE INDEX idx_subauth
ON comments(subreddit, author);

I've also tried just narrowing it down to the subreddit,author index with no improvement. I am further narrowing down my search by removing [deleted] users from the list of rows. My query is as follows:

SELECT author, subreddit, count(*) as numcomments 
from comments
WHERE author IN (SELECT author FROM comments WHERE subreddit="politics"  AND author != "[deleted]")
group by author, subreddit
ORDER BY author
LIMIT 100
;

According to my explain plan, this returns 3 million rows, which is expected of a nearly 100Gb dataset.

The query takes well over 300 seconds to run for large subreddits such as /r/politics. Smaller ones with less activity run in a second or less. Is there anything I can do to improve this execution time? I've tried running the query through EverSQL and using the query they specified as well as the single subreddit,author composite index they recommended but it actually made the runtime worse. I know there are third party options like pushShift API which utilizes google bigquery but because I'd like to work on this offline I want to do it all locally. Lastly, I've thought of just getting all the comments and "counting" them myself instead of using mySql's count(*) method and group by but even so the query takes a while to retrieve all the comments (15 million) that I'd have to process on the back end. Is there a solution to this? Something like a Redis caching system? Partitioning? I wish to get this query under 3 seconds if possible. Any feedback is appreciated.

Per a user's suggestion I have run an explain on this query:


SELECT x.author
     , x.subreddit
     , COUNT(*) numcomments 
  FROM comments x
  JOIN  
     ( SELECT author 
         FROM comments 
        WHERE subreddit = "politics"  
          AND author != "[deleted]"
     ) y
    ON y.author = x.author
 GROUP 
    BY x.author
     , x.subreddit;

and the EXPLAIN produced this:

Why don't you directly put the WHERE condition on the outer query rather than making a sub-query which basically querying the same table? — FanoFN
– FanoFN, Commented Mar 3, 2020 at 5:34
as @tcadidot0 said try without sub-query. it should be same result. SELECT author, subreddit, count(*) as numcomments from comments WHERE subreddit="politics" AND author != "[deleted]" group by author, subreddit LIMIT 100. — Lokesh Jain
– Lokesh Jain, Commented Mar 3, 2020 at 5:40
Sorry, I probably should have specified better in my question but I want a list of ALL the author's posts in EVERY subreddit. By removing the second query I only get a list of the author's posts in the specified subreddit whereas I want the number of comments for every subreddit of the author "who has also posted in '/r/politics'. — RaulT
– RaulT, Commented Mar 3, 2020 at 5:51

drHodge · Accepted Answer · 2020-03-03 05:42:59Z

1

Move the criteria directly in the main query. By adding two selects you are doing at least twice the work. Good luck.

SELECT author, subreddit, count(*) as numcomments 
from comments
WHERE subreddit="politics"  AND author != "[deleted]"
group by author, subreddit
LIMIT 100
;

answered Mar 3, 2020 at 5:42

drHodge

768 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

RaulT Over a year ago

My apologies, I probably didn't do my best to clarify that I don't want JUST the posts in the specified "subreddit" but rather I want all the author's posts who has ALSO posted in the subreddit parameter I'm specifying. Thats why I'm using a subquery with an exists. Maybe a JOIN might fare better?

drHodge Over a year ago

Understood. Perhaps you need to join to comments twice for example. from comments commentA join comments commentB on commentA.author = commentB.author and commentB.subreddit="politics".

Collectives™ on Stack Overflow

Well-indexed MySQL query takes long to execute. EXPLAIN shows 3m rows. Can it be sped up or improved?

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related