Python - speed up Pandas iteration

Question

Sorry for my bad english.

This is a simplified version of my dataframe:

d = {'League': {5697: 'Premier League', 5695: 'Premier League', 5694: 'Premier League', 5693: 'Premier League', 5692: 'Premier League', 5691: 'Premier League', 5696: 'Premier League', 5689: 'Premier League', 5688: 'Premier League', 5690: 'Premier League', 5684: 'Premier League', 5680: 'Premier League', 5681: 'Premier League', 5682: 'Premier League', 5686: 'Premier League', 5685: 'Premier League', 5687: 'Premier League', 5683: 'Premier League', 5678: 'Premier League', 5679: 'Premier League', 5677: 'Premier League', 5674: 'Premier League', 5676: 'Premier League', 5675: 'Premier League', 5673: 'Premier League', 5670: 'Premier League', 5672: 'Premier League', 5671: 'Premier League', 5665: 'Premier League', 5667: 'Premier League', 5666: 'Premier League', 5669: 'Premier League', 5664: 'Premier League', 5663: 'Premier League', 5662: 'Premier League', 5668: 'Premier League', 5661: 'Premier League', 5660: 'Premier League', 5654: 'Premier League', 5653: 'Premier League', 5655: 'Premier League', 5656: 'Premier League', 5657: 'Premier League', 5658: 'Premier League', 5659: 'Premier League', 5652: 'Premier League', 5651: 'Premier League', 5650: 'Premier League', 5649: 'Premier League', 5642: 'Premier League'}, 'Date': {5697: '2003-08-16 13:30', 5695: '2003-08-16 16:00', 5694: '2003-08-16 16:00', 5693: '2003-08-16 16:00', 5692: '2003-08-16 16:00', 5691: '2003-08-16 16:00', 5696: '2003-08-16 16:00', 5689: '2003-08-17 15:00', 5688: '2003-08-17 15:00', 5690: '2003-08-17 17:05', 5684: '2003-08-23 13:00', 5680: '2003-08-23 16:00', 5681: '2003-08-23 16:00', 5682: '2003-08-23 16:00', 5686: '2003-08-23 16:00', 5685: '2003-08-23 16:00', 5687: '2003-08-23 16:00', 5683: '2003-08-23 16:00', 5678: '2003-08-24 15:00', 5679: '2003-08-24 17:05', 5677: '2003-08-25 21:00', 5674: '2003-08-26 20:45', 5676: '2003-08-26 20:45', 5675: '2003-08-26 20:45', 5673: '2003-08-26 21:00', 5670: '2003-08-27 20:45', 5672: '2003-08-27 21:00', 5671: '2003-08-27 21:00', 5665: '2003-08-30 13:30', 5667: '2003-08-30 16:00', 5666: '2003-08-30 16:00', 5669: '2003-08-30 16:00', 5664: '2003-08-30 16:00', 5663: '2003-08-30 16:00', 5662: '2003-08-30 16:00', 5668: '2003-08-30 16:00', 5661: '2003-08-31 15:00', 5660: '2003-08-31 17:05', 5654: '2003-09-13 16:00', 5653: '2003-09-13 16:00', 5655: '2003-09-13 16:00', 5656: '2003-09-13 16:00', 5657: '2003-09-13 16:00', 5658: '2003-09-13 16:00', 5659: '2003-09-13 16:00', 5652: '2003-09-14 15:00', 5651: '2003-09-14 17:05', 5650: '2003-09-15 21:00', 5649: '2003-09-20 13:30', 5642: '2003-09-20 16:00'}, 'HomeTeam': {5697: 'Portsmouth', 5695: 'Leicester', 5694: 'Fulham', 5693: 'Blackburn', 5692: 'Birmingham', 5691: 'Arsenal', 5696: 'Manchester United', 5689: 'Leeds', 5688: 'Charlton', 5690: 'Liverpool', 5684: 'Newcastle', 5680: 'Bolton', 5681: 'Chelsea', 5682: 'Everton', 5686: 'Tottenham', 5685: 'Southampton', 5687: 'Wolves', 5683: 'Manchester City', 5678: 'Aston Villa', 5679: 'Middlesbrough', 5677: 'Blackburn', 5674: 'Leeds', 5676: 'Portsmouth', 5675: 'Leicester', 5673: 'Charlton', 5670: 'Arsenal', 5672: 'Manchester United', 5671: 'Liverpool', 5665: 'Everton', 5667: 'Newcastle', 5666: 'Middlesbrough', 5669: 'Wolves', 5664: 'Chelsea', 5663: 'Bolton', 5662: 'Aston Villa', 5668: 'Tottenham', 5661: 'Southampton', 5660: 'Manchester City', 5654: 'Blackburn', 5653: 'Arsenal', 5655: 'Bolton', 5656: 'Charlton', 5657: 'Chelsea', 5658: 'Everton', 5659: 'Southampton', 5652: 'Manchester City', 5651: 'Birmingham', 5650: 'Leicester', 5649: 'Wolves', 5642: 'Aston Villa'}, 'AwayTeam': {5697: 'Aston Villa', 5695: 'Southampton', 5694: 'Middlesbrough', 5693: 'Wolves', 5692: 'Tottenham', 5691: 'Everton', 5696: 'Bolton', 5689: 'Newcastle', 5688: 'Manchester City', 5690: 'Chelsea', 5684: 'Manchester United', 5680: 'Blackburn', 5681: 'Leicester', 5682: 'Fulham', 5686: 'Leeds', 5685: 'Birmingham', 5687: 'Charlton', 5683: 'Portsmouth', 5678: 'Liverpool', 5679: 'Arsenal', 5677: 'Manchester City', 5674: 'Southampton', 5676: 'Bolton', 5675: 'Middlesbrough', 5673: 'Everton', 5670: 'Aston Villa', 5672: 'Wolves', 5671: 'Tottenham', 5665: 'Liverpool', 5667: 'Birmingham', 5666: 'Leeds', 5669: 'Portsmouth', 5664: 'Blackburn', 5663: 'Charlton', 5662: 'Leicester', 5668: 'Fulham', 5661: 'Manchester United', 5660: 'Arsenal', 5654: 'Liverpool', 5653: 'Portsmouth', 5655: 'Middlesbrough', 5656: 'Manchester United', 5657: 'Tottenham', 5658: 'Newcastle', 5659: 'Wolves', 5652: 'Aston Villa', 5651: 'Fulham', 5650: 'Leeds', 5649: 'Chelsea', 5642: 'Charlton'}, 'GTOT': {5697: 3, 5695: 4, 5694: 5, 5693: 6, 5692: 1, 5691: 3, 5696: 4, 5689: 4, 5688: 3, 5690: 3, 5684: 3, 5680: 4, 5681: 3, 5682: 4, 5686: 3, 5685: 0, 5687: 4, 5683: 2, 5678: 0, 5679: 4, 5677: 5, 5674: 0, 5676: 4, 5675: 0, 5673: 4, 5670: 2, 5672: 1, 5671: 0, 5665: 3, 5667: 1, 5666: 5, 5669: 0, 5664: 4, 5663: 0, 5662: 4, 5668: 3, 5661: 1, 5660: 3, 5654: 4, 5653: 2, 5655: 2, 5656: 2, 5657: 6, 5658: 4, 5659: 2, 5652: 5, 5651: 4, 5650: 4, 5649: 5, 5642: 3}}

df = pd.DataFrame.from_dict(d)

Where GTOT is the total number of goals in the match.

For every team in the competition I want to find out the number of consecutive matches where GTOT is minor 1 and add it to a list.

I use itertuples to iterate over the dataframe, but it's very slow for a large number of matches (> 50000) and teams (> 100). This is a semplified version of my function:

def calculates_overdue():
    lstTeams = [team for team in df.HomeTeam.unique()]
    lstTeams.sort()
    lstOverdue = []
    for team in lstTeams:
        overdue = 0
        mask = (df.HomeTeam == team) | (df.AwayTeam == team)
        for row in df[mask].itertuples():
            if row.GTOT < 1:
                lstOverdue.append((row.League, team, overdue, row.Date))
                overdue = 0
            else:
                overdue += 1
    return lstOverdue

print(calculates_overdue())

Result:

[('Premier League', 'Aston Villa', 4, '2003-08-24 15:00'), ('Premier League', 'Birmingham', 2, '2003-08-23 16:00'), ('Premier League', 'Bolton', 1, '2003-08-30 16:00'), ('Premier League', 'Charlton', 2, '2003-08-30 16:00'), ('Premier League', 'Leeds', 2, '2003-08-26 20:45'), ('Premier League', 'Leicester', 2, '2003-08-26 20:45'), ('Premier League', 'Liverpool', 2, '2003-08-27 21:00'), ('Premier League', 'Liverpool', 0, '2003-08-24 15:00'), ('Premier League', 'Middlesbrough', 2, '2003-08-26 20:45'), ('Premier League', 'Portsmouth', 1, '2003-08-30 16:00'), ('Premier League', 'Southampton', 2, '2003-08-26 20:45'), ('Premier League', 'Southampton', 0, '2003-08-23 16:00'), ('Premier League', 'Tottenham', 2, '2003-08-27 21:00'), ('Premier League', 'Wolves', 2, '2003-08-30 16:00')]

I know it's bad iterate over rows. How can I speed up these loops? Is it possible vectorize this function?

more importantly, you can provide some code to construct a sample dataframe, so that people can not just look the dataframe printout, but also run your code. stackoverflow.com/questions/20109391/… — Evgeny
– Evgeny, Commented Jun 15, 2018 at 13:01
@EvgenyPogrebnyak normally I would agree, but pandas has a nice builtin way to import dataframes from SO questions: df = pd.read_clipboard(sep="\s\s+") (I just had to adapt sep to not get fooled by single spaces in each cell) — filippo
– filippo, Commented Jun 15, 2018 at 14:16
@filippo: even though pd.read_clipboard is a nice feature, providing runnable code seems a responsiility of the OP, with a sample data included. you would still persist the dataframe in code (eg with .to_dict()), if you are editing/running it locally, to make the example reproducible. I would agree a printout is better than nothing, but it does not make a perfect question that is easy to deal with. — Evgeny
– Evgeny, Commented Jun 15, 2018 at 14:52
@EvgenyPogrebnyak completely agree! it's just I've seen a lot worse (screenshots?) here than I've grown to accept somewhat parsable printouts :-) — filippo
– filippo, Commented Jun 15, 2018 at 15:00

sonus21 · Accepted Answer · 2018-06-16 06:59:04Z

1

Use simple python call and remove some of pandas call.

def testTime():
        df = pd.DataFrame.from_dict(d)
        #print(df)
        #print(d.keys())
        lstOverdue = []
        lstTeams = [team for team in df.HomeTeam.unique()]
        for team in lstTeams:
                mask = (df.HomeTeam == team) | (df.AwayTeam == team)
                overdue = 0
                for row in df[mask].itertuples():
                        if row.GTOT < 1:
                                lstOverdue.append((row.League, team, overdue, row.Date))
                                overdue = 0
                        else:
                                overdue += 1
def testTime2():
        df = pd.DataFrame.from_dict(d)
        lstOverdue = []
        for team in df.HomeTeam.unique():
                overdue = 0
                for row in df.itertuples():
                        if row.HomeTeam == team or team == row.AwayTeam:
                                if row.GTOT < 1:
                                        lstOverdue.append((row.League, team, overdue, row.Date))
                                        overdue = 0
                                else:
                                        overdue += 1

if __name__ == '__main__':
        import timeit
        print(timeit.timeit("testTime2()", setup="from __main__ import testTime2", number=1000))
        print(timeit.timeit("testTime()", setup="from __main__ import testTime", number=1000))

see function testTime2, after running this code I observed around half of running time on my system with the data you've provided, but it can vary on your system as well.

$ python3.6 test.py
21.352469262998056
39.04840800600141
$ python3.6 test.py
22.977724283999123
39.568580347000534
$ python3.6 test.py
21.838805349998438
39.85868542999742

This can be further reduced to number of data rows but it requires some preprocessing as follows.

Convert dictionary to list of namedtuples during conversion create set of HomeTown.
Process the list similar to the function testTime2 process data frames.

edited Jun 16, 2018 at 6:59

answered Jun 15, 2018 at 18:36

sonus21

5,4082 gold badges27 silver badges52 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Evgeny Over a year ago

can you please replace the image with text?

sonus21 Over a year ago

@EvgenyPogrebnyak not sure why we have to do that, anyway changed to text.

Marco Montresor Over a year ago

Hmm...i try your funct, but without mask for df the execution time for DataFrame with 20000 rows increase x5.

Collectives™ on Stack Overflow

Python - speed up Pandas iteration

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related