How to iterate through a nested for loop in pandas dataframe?

Question

I am attempting to iterate through a Hacker News dataset and was trying to create 3 categories (i.e types of posts) found on the HN forum viz, ask_posts, show_posts and other_posts.

In short, I am trying to find out the average number of comments per posts per category(described below).

import pandas as pd
import datetime as dt

df = pd.read_csv('HN_posts_year_to_Sep_26_2016.csv')

ask_posts = []
show_posts = []
other_post = []
total_ask_comments = 0
total_show_comments = 0

for i, row in df.iterrows():
    title = row.title
    comments = row['num_comments']
    if title.lower().startswith('ask hn'):
        ask_posts.append(title)
        for post in ask_posts:
            total_ask_comments += comments
    elif title.lower().startswith('show hn'):
        show_posts.append(title)
        for post in show_posts:
             total_show_comments += comments
    else:
        other_post.append(title)

avg_ask_comments = total_ask_comments/len(ask_posts)
avg_show_comments = total_show_comments/len(show_posts)


print(total_ask_comments)
print(total_show_comments)

print(avg_ask_comments)
print(avg_show_comments)

The results respectively are;

395976587

250362315

and

43328.21829521829

24646.81187241583

These seem quite high and I am not sure if it because this is an issue with the way I have structured my nested loop. Is this method correct? It is critical that I use a for loop to do this.

Any and all help/verification of my code is appreciated.

"It is critical that I use a for loop to do this" why is that critical? That is definitely not how you should be using pandas. In any case, this seems wrong, you loop through ask_posts and show_posts each time you append to add to the count. It is totally pointless, because you just want the len of each of those lists at the end. — juanpa.arrivillaga
– juanpa.arrivillaga, Commented Apr 25, 2020 at 20:02
your loop where you're adding comments to total_ask_comments doesn't look right. Maybe that shouldn't be a loop. you're counting items in ask_posts that have already been counted in that loop over and over. — Todd
– Todd, Commented Apr 25, 2020 at 20:53
Thank you guys for the comments. I should have mentioned that I am new to python programming (~3 months) so do bear with me and the mistakes. How would you recommend I go about it? — Oscar Agbor
– Oscar Agbor, Commented Apr 25, 2020 at 21:23
Well, I'd recommend that you use filtering to isolate the records that you're interested in. I've put some examples in an answer below. Welcome to Python and pandas programming. You're doing fine for starting out =) — Todd
– Todd, Commented Apr 25, 2020 at 21:52

Todd · Accepted Answer · 2020-04-25 22:19:47Z

This post doesn't answer specifically the question about looping through dataframes; but it gives you an alternative solution which is faster.

Looping over Pandas dataframes to gather the information as you have it is going to be tremendously slow. It's much much faster to use filtering to get the information you want.

>>> show_posts = df[df.title.str.contains("show hn", case=False)]
>>> show_posts
              id  ...       created_at
52      12578335  ...   9/26/2016 0:36
58      12578182  ...   9/26/2016 0:01
64      12578098  ...  9/25/2016 23:44
70      12577991  ...  9/25/2016 23:17
140     12577142  ...  9/25/2016 20:06
...          ...  ...              ...
292995  10177714  ...   9/6/2015 14:21
293002  10177631  ...   9/6/2015 13:50
293019  10177511  ...   9/6/2015 13:02
293028  10177459  ...   9/6/2015 12:38
293037  10177421  ...   9/6/2015 12:16

[10189 rows x 7 columns]
>>> ask_posts = df[df.title.str.contains("ask hn", case=False)]
>>> ask_posts
              id  ...       created_at
10      12578908  ...   9/26/2016 2:53
42      12578522  ...   9/26/2016 1:17
76      12577908  ...  9/25/2016 22:57
80      12577870  ...  9/25/2016 22:48
102     12577647  ...  9/25/2016 21:50
...          ...  ...              ...
293047  10177359  ...   9/6/2015 11:27
293052  10177317  ...   9/6/2015 10:52
293055  10177309  ...   9/6/2015 10:46
293073  10177200  ...    9/6/2015 9:36
293114  10176919  ...    9/6/2015 6:02

[9147 rows x 7 columns]

You can get your numbers very quickly this way

>>> num_ask_comments = ask_posts.num_comments.sum()
>>> num_ask_comments
95000
>>> num_show_comments = show_posts.num_comments.sum()
>>> num_show_comments
50026
>>> 
>>> total_num_comments = df.num_comments.sum()
>>> total_num_comments
1912761
>>> 
>>> # Get a ratio of the number ask comments to total number of comments
>>> num_ask_comments / total_num_comments
0.04966642460819726
>>>

Also you'll get different numbers with .startswith() vs. .contains() (I'm not sure which you want).

>>> ask_posts = df[df.title.str.lower().str.startswith("ask hn")]
>>> len(ask_posts)
9139
>>> 
>>> ask_posts = df[df.title.str.contains("ask hn", case=False)]
>>> len(ask_posts)
9147
>>>

The pattern argument to .contains() can be a regular expression - which is very useful. So we can specify all records that begin with "ask hn" at the very start of the title, but if we're not sure if any whitespace will be in front of it, we can do

>>> ask_posts = df[df.title.str.contains(r"^\s*ask hn", case=False)]
>>> len(ask_posts)
9139
>>>

What's happening in the filter statements is probably difficult to grasp when you're starting out using Pandas. The expression in the square brackets of df[df.title.str.contains("show hn", case=False)] for instance.

What the statement inside the square brackets (df.title.str.contains("show hn", case=False)) produces is a column of True and False values - a boolean filter (not sure if that's what it's called but it has that effect).

So that boolean column that's produced is used to select rows in the dataframe, df[<bool column>], and it produces a new dataframe with the matching records. We can then use that to extract other information - like the summation of the comments column.

Wow. I think i've learned more from that comment than all 3 months of my intro course :). I appreciate you taking the time to even show some example. Really means a great deal. I will practice these methods and apply them to my data set. Thank you Todd
You're very welcome @OscarAgbor. It makes me feel good that I can help you out.

Collectives™ on Stack Overflow

How to iterate through a nested for loop in pandas dataframe?

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related