2

I have a database with over 100 million rows of reddit comment data in the format of:

{
 author: redditauthor1,
 body: example comment,
 subreddit: /r/funny,
 ....
}

I am trying to get a list of users with their respective number of comments for all the subreddits they posted in. I am also narrowing it down by users who also posted in the subreddit I pass through as a parameter.

I have 4 indexes for this single table. Reason being is that I only plan on reading from this for the time being. The indexes look like so:

CREATE INDEX idx_subreddit
ON comments(subreddit);

CREATE INDEX idx_author
ON comments(author);

CREATE INDEX idx_authsub
ON comments(author, subreddit);

CREATE INDEX idx_subauth
ON comments(subreddit, author);

I've also tried just narrowing it down to the subreddit,author index with no improvement. I am further narrowing down my search by removing [deleted] users from the list of rows. My query is as follows:

SELECT author, subreddit, count(*) as numcomments 
from comments
WHERE author IN (SELECT author FROM comments WHERE subreddit="politics"  AND author != "[deleted]")
group by author, subreddit
ORDER BY author
LIMIT 100
;

According to my explain plan, this returns 3 million rows, which is expected of a nearly 100Gb dataset.

mysqlexplain

The query takes well over 300 seconds to run for large subreddits such as /r/politics. Smaller ones with less activity run in a second or less. Is there anything I can do to improve this execution time? I've tried running the query through EverSQL and using the query they specified as well as the single subreddit,author composite index they recommended but it actually made the runtime worse. I know there are third party options like pushShift API which utilizes google bigquery but because I'd like to work on this offline I want to do it all locally. Lastly, I've thought of just getting all the comments and "counting" them myself instead of using mySql's count(*) method and group by but even so the query takes a while to retrieve all the comments (15 million) that I'd have to process on the back end. Is there a solution to this? Something like a Redis caching system? Partitioning? I wish to get this query under 3 seconds if possible. Any feedback is appreciated.


Per a user's suggestion I have run an explain on this query:


SELECT x.author
     , x.subreddit
     , COUNT(*) numcomments 
  FROM comments x
  JOIN  
     ( SELECT author 
         FROM comments 
        WHERE subreddit = "politics"  
          AND author != "[deleted]"
     ) y
    ON y.author = x.author
 GROUP 
    BY x.author
     , x.subreddit;

and the EXPLAIN produced this: explain2

11
  • Why don't you directly put the WHERE condition on the outer query rather than making a sub-query which basically querying the same table? Commented Mar 3, 2020 at 5:34
  • as @tcadidot0 said try without sub-query. it should be same result. SELECT author, subreddit, count(*) as numcomments from comments WHERE subreddit="politics" AND author != "[deleted]" group by author, subreddit LIMIT 100. Commented Mar 3, 2020 at 5:40
  • Sorry, I probably should have specified better in my question but I want a list of ALL the author's posts in EVERY subreddit. By removing the second query I only get a list of the author's posts in the specified subreddit whereas I want the number of comments for every subreddit of the author "who has also posted in '/r/politics'. Commented Mar 3, 2020 at 5:51
  • How many author do you get from the sub-query? Commented Mar 3, 2020 at 5:53
  • 1
    Will do, I'll provide an update with some attempts Commented Mar 3, 2020 at 6:08

1 Answer 1

1

Move the criteria directly in the main query. By adding two selects you are doing at least twice the work. Good luck.

SELECT author, subreddit, count(*) as numcomments 
from comments
WHERE subreddit="politics"  AND author != "[deleted]"
group by author, subreddit
LIMIT 100
;
Sign up to request clarification or add additional context in comments.

2 Comments

My apologies, I probably didn't do my best to clarify that I don't want JUST the posts in the specified "subreddit" but rather I want all the author's posts who has ALSO posted in the subreddit parameter I'm specifying. Thats why I'm using a subquery with an exists. Maybe a JOIN might fare better?
Understood. Perhaps you need to join to comments twice for example. from comments commentA join comments commentB on commentA.author = commentB.author and commentB.subreddit="politics".

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.