I have a database with over 100 million rows of reddit comment data in the format of:
{
author: redditauthor1,
body: example comment,
subreddit: /r/funny,
....
}
I am trying to get a list of users with their respective number of comments for all the subreddits they posted in. I am also narrowing it down by users who also posted in the subreddit I pass through as a parameter.
I have 4 indexes for this single table. Reason being is that I only plan on reading from this for the time being. The indexes look like so:
CREATE INDEX idx_subreddit
ON comments(subreddit);
CREATE INDEX idx_author
ON comments(author);
CREATE INDEX idx_authsub
ON comments(author, subreddit);
CREATE INDEX idx_subauth
ON comments(subreddit, author);
I've also tried just narrowing it down to the subreddit,author index with no improvement. I am further narrowing down my search by removing [deleted] users from the list of rows. My query is as follows:
SELECT author, subreddit, count(*) as numcomments
from comments
WHERE author IN (SELECT author FROM comments WHERE subreddit="politics" AND author != "[deleted]")
group by author, subreddit
ORDER BY author
LIMIT 100
;
According to my explain plan, this returns 3 million rows, which is expected of a nearly 100Gb dataset.
The query takes well over 300 seconds to run for large subreddits such as /r/politics. Smaller ones with less activity run in a second or less. Is there anything I can do to improve this execution time? I've tried running the query through EverSQL and using the query they specified as well as the single subreddit,author composite index they recommended but it actually made the runtime worse. I know there are third party options like pushShift API which utilizes google bigquery but because I'd like to work on this offline I want to do it all locally. Lastly, I've thought of just getting all the comments and "counting" them myself instead of using mySql's count(*) method and group by but even so the query takes a while to retrieve all the comments (15 million) that I'd have to process on the back end. Is there a solution to this? Something like a Redis caching system? Partitioning? I wish to get this query under 3 seconds if possible. Any feedback is appreciated.
Per a user's suggestion I have run an explain on this query:
SELECT x.author
, x.subreddit
, COUNT(*) numcomments
FROM comments x
JOIN
( SELECT author
FROM comments
WHERE subreddit = "politics"
AND author != "[deleted]"
) y
ON y.author = x.author
GROUP
BY x.author
, x.subreddit;


WHEREcondition on the outer query rather than making a sub-query which basically querying the same table?authordo you get from the sub-query?