This post doesn't answer specifically the question about looping through dataframes; but it gives you an alternative solution which is faster.
Looping over Pandas dataframes to gather the information as you have it is going to be tremendously slow. It's much much faster to use filtering to get the information you want.
>>> show_posts = df[df.title.str.contains("show hn", case=False)]
>>> show_posts
id ... created_at
52 12578335 ... 9/26/2016 0:36
58 12578182 ... 9/26/2016 0:01
64 12578098 ... 9/25/2016 23:44
70 12577991 ... 9/25/2016 23:17
140 12577142 ... 9/25/2016 20:06
... ... ... ...
292995 10177714 ... 9/6/2015 14:21
293002 10177631 ... 9/6/2015 13:50
293019 10177511 ... 9/6/2015 13:02
293028 10177459 ... 9/6/2015 12:38
293037 10177421 ... 9/6/2015 12:16
[10189 rows x 7 columns]
>>> ask_posts = df[df.title.str.contains("ask hn", case=False)]
>>> ask_posts
id ... created_at
10 12578908 ... 9/26/2016 2:53
42 12578522 ... 9/26/2016 1:17
76 12577908 ... 9/25/2016 22:57
80 12577870 ... 9/25/2016 22:48
102 12577647 ... 9/25/2016 21:50
... ... ... ...
293047 10177359 ... 9/6/2015 11:27
293052 10177317 ... 9/6/2015 10:52
293055 10177309 ... 9/6/2015 10:46
293073 10177200 ... 9/6/2015 9:36
293114 10176919 ... 9/6/2015 6:02
[9147 rows x 7 columns]
You can get your numbers very quickly this way
>>> num_ask_comments = ask_posts.num_comments.sum()
>>> num_ask_comments
95000
>>> num_show_comments = show_posts.num_comments.sum()
>>> num_show_comments
50026
>>>
>>> total_num_comments = df.num_comments.sum()
>>> total_num_comments
1912761
>>>
>>> # Get a ratio of the number ask comments to total number of comments
>>> num_ask_comments / total_num_comments
0.04966642460819726
>>>
Also you'll get different numbers with .startswith() vs. .contains() (I'm not sure which you want).
>>> ask_posts = df[df.title.str.lower().str.startswith("ask hn")]
>>> len(ask_posts)
9139
>>>
>>> ask_posts = df[df.title.str.contains("ask hn", case=False)]
>>> len(ask_posts)
9147
>>>
The pattern argument to .contains() can be a regular expression - which is very useful. So we can specify all records that begin with "ask hn" at the very start of the title, but if we're not sure if any whitespace will be in front of it, we can do
>>> ask_posts = df[df.title.str.contains(r"^\s*ask hn", case=False)]
>>> len(ask_posts)
9139
>>>
What's happening in the filter statements is probably difficult to grasp when you're starting out using Pandas. The expression in the square brackets of df[df.title.str.contains("show hn", case=False)] for instance.
What the statement inside the square brackets (df.title.str.contains("show hn", case=False)) produces is a column of True and False values - a boolean filter (not sure if that's what it's called but it has that effect).
So that boolean column that's produced is used to select rows in the dataframe, df[<bool column>], and it produces a new dataframe with the matching records. We can then use that to extract other information - like the summation of the comments column.
pandas. In any case, this seems wrong, you loop throughask_postsandshow_postseach time you append to add to the count. It is totally pointless, because you just want thelenof each of those lists at the end.commentstototal_ask_commentsdoesn't look right. Maybe that shouldn't be a loop. you're counting items inask_poststhat have already been counted in that loop over and over.