1

Sorry for my bad english.

This is a simplified version of my dataframe:

d = {'League': {5697: 'Premier League', 5695: 'Premier League', 5694: 'Premier League', 5693: 'Premier League', 5692: 'Premier League', 5691: 'Premier League', 5696: 'Premier League', 5689: 'Premier League', 5688: 'Premier League', 5690: 'Premier League', 5684: 'Premier League', 5680: 'Premier League', 5681: 'Premier League', 5682: 'Premier League', 5686: 'Premier League', 5685: 'Premier League', 5687: 'Premier League', 5683: 'Premier League', 5678: 'Premier League', 5679: 'Premier League', 5677: 'Premier League', 5674: 'Premier League', 5676: 'Premier League', 5675: 'Premier League', 5673: 'Premier League', 5670: 'Premier League', 5672: 'Premier League', 5671: 'Premier League', 5665: 'Premier League', 5667: 'Premier League', 5666: 'Premier League', 5669: 'Premier League', 5664: 'Premier League', 5663: 'Premier League', 5662: 'Premier League', 5668: 'Premier League', 5661: 'Premier League', 5660: 'Premier League', 5654: 'Premier League', 5653: 'Premier League', 5655: 'Premier League', 5656: 'Premier League', 5657: 'Premier League', 5658: 'Premier League', 5659: 'Premier League', 5652: 'Premier League', 5651: 'Premier League', 5650: 'Premier League', 5649: 'Premier League', 5642: 'Premier League'}, 'Date': {5697: '2003-08-16 13:30', 5695: '2003-08-16 16:00', 5694: '2003-08-16 16:00', 5693: '2003-08-16 16:00', 5692: '2003-08-16 16:00', 5691: '2003-08-16 16:00', 5696: '2003-08-16 16:00', 5689: '2003-08-17 15:00', 5688: '2003-08-17 15:00', 5690: '2003-08-17 17:05', 5684: '2003-08-23 13:00', 5680: '2003-08-23 16:00', 5681: '2003-08-23 16:00', 5682: '2003-08-23 16:00', 5686: '2003-08-23 16:00', 5685: '2003-08-23 16:00', 5687: '2003-08-23 16:00', 5683: '2003-08-23 16:00', 5678: '2003-08-24 15:00', 5679: '2003-08-24 17:05', 5677: '2003-08-25 21:00', 5674: '2003-08-26 20:45', 5676: '2003-08-26 20:45', 5675: '2003-08-26 20:45', 5673: '2003-08-26 21:00', 5670: '2003-08-27 20:45', 5672: '2003-08-27 21:00', 5671: '2003-08-27 21:00', 5665: '2003-08-30 13:30', 5667: '2003-08-30 16:00', 5666: '2003-08-30 16:00', 5669: '2003-08-30 16:00', 5664: '2003-08-30 16:00', 5663: '2003-08-30 16:00', 5662: '2003-08-30 16:00', 5668: '2003-08-30 16:00', 5661: '2003-08-31 15:00', 5660: '2003-08-31 17:05', 5654: '2003-09-13 16:00', 5653: '2003-09-13 16:00', 5655: '2003-09-13 16:00', 5656: '2003-09-13 16:00', 5657: '2003-09-13 16:00', 5658: '2003-09-13 16:00', 5659: '2003-09-13 16:00', 5652: '2003-09-14 15:00', 5651: '2003-09-14 17:05', 5650: '2003-09-15 21:00', 5649: '2003-09-20 13:30', 5642: '2003-09-20 16:00'}, 'HomeTeam': {5697: 'Portsmouth', 5695: 'Leicester', 5694: 'Fulham', 5693: 'Blackburn', 5692: 'Birmingham', 5691: 'Arsenal', 5696: 'Manchester United', 5689: 'Leeds', 5688: 'Charlton', 5690: 'Liverpool', 5684: 'Newcastle', 5680: 'Bolton', 5681: 'Chelsea', 5682: 'Everton', 5686: 'Tottenham', 5685: 'Southampton', 5687: 'Wolves', 5683: 'Manchester City', 5678: 'Aston Villa', 5679: 'Middlesbrough', 5677: 'Blackburn', 5674: 'Leeds', 5676: 'Portsmouth', 5675: 'Leicester', 5673: 'Charlton', 5670: 'Arsenal', 5672: 'Manchester United', 5671: 'Liverpool', 5665: 'Everton', 5667: 'Newcastle', 5666: 'Middlesbrough', 5669: 'Wolves', 5664: 'Chelsea', 5663: 'Bolton', 5662: 'Aston Villa', 5668: 'Tottenham', 5661: 'Southampton', 5660: 'Manchester City', 5654: 'Blackburn', 5653: 'Arsenal', 5655: 'Bolton', 5656: 'Charlton', 5657: 'Chelsea', 5658: 'Everton', 5659: 'Southampton', 5652: 'Manchester City', 5651: 'Birmingham', 5650: 'Leicester', 5649: 'Wolves', 5642: 'Aston Villa'}, 'AwayTeam': {5697: 'Aston Villa', 5695: 'Southampton', 5694: 'Middlesbrough', 5693: 'Wolves', 5692: 'Tottenham', 5691: 'Everton', 5696: 'Bolton', 5689: 'Newcastle', 5688: 'Manchester City', 5690: 'Chelsea', 5684: 'Manchester United', 5680: 'Blackburn', 5681: 'Leicester', 5682: 'Fulham', 5686: 'Leeds', 5685: 'Birmingham', 5687: 'Charlton', 5683: 'Portsmouth', 5678: 'Liverpool', 5679: 'Arsenal', 5677: 'Manchester City', 5674: 'Southampton', 5676: 'Bolton', 5675: 'Middlesbrough', 5673: 'Everton', 5670: 'Aston Villa', 5672: 'Wolves', 5671: 'Tottenham', 5665: 'Liverpool', 5667: 'Birmingham', 5666: 'Leeds', 5669: 'Portsmouth', 5664: 'Blackburn', 5663: 'Charlton', 5662: 'Leicester', 5668: 'Fulham', 5661: 'Manchester United', 5660: 'Arsenal', 5654: 'Liverpool', 5653: 'Portsmouth', 5655: 'Middlesbrough', 5656: 'Manchester United', 5657: 'Tottenham', 5658: 'Newcastle', 5659: 'Wolves', 5652: 'Aston Villa', 5651: 'Fulham', 5650: 'Leeds', 5649: 'Chelsea', 5642: 'Charlton'}, 'GTOT': {5697: 3, 5695: 4, 5694: 5, 5693: 6, 5692: 1, 5691: 3, 5696: 4, 5689: 4, 5688: 3, 5690: 3, 5684: 3, 5680: 4, 5681: 3, 5682: 4, 5686: 3, 5685: 0, 5687: 4, 5683: 2, 5678: 0, 5679: 4, 5677: 5, 5674: 0, 5676: 4, 5675: 0, 5673: 4, 5670: 2, 5672: 1, 5671: 0, 5665: 3, 5667: 1, 5666: 5, 5669: 0, 5664: 4, 5663: 0, 5662: 4, 5668: 3, 5661: 1, 5660: 3, 5654: 4, 5653: 2, 5655: 2, 5656: 2, 5657: 6, 5658: 4, 5659: 2, 5652: 5, 5651: 4, 5650: 4, 5649: 5, 5642: 3}}

df = pd.DataFrame.from_dict(d)

Where GTOT is the total number of goals in the match.

For every team in the competition I want to find out the number of consecutive matches where GTOT is minor 1 and add it to a list.

I use itertuples to iterate over the dataframe, but it's very slow for a large number of matches (> 50000) and teams (> 100). This is a semplified version of my function:

def calculates_overdue():
    lstTeams = [team for team in df.HomeTeam.unique()]
    lstTeams.sort()
    lstOverdue = []
    for team in lstTeams:
        overdue = 0
        mask = (df.HomeTeam == team) | (df.AwayTeam == team)
        for row in df[mask].itertuples():
            if row.GTOT < 1:
                lstOverdue.append((row.League, team, overdue, row.Date))
                overdue = 0
            else:
                overdue += 1
    return lstOverdue

print(calculates_overdue())

Result:

[('Premier League', 'Aston Villa', 4, '2003-08-24 15:00'), ('Premier League', 'Birmingham', 2, '2003-08-23 16:00'), ('Premier League', 'Bolton', 1, '2003-08-30 16:00'), ('Premier League', 'Charlton', 2, '2003-08-30 16:00'), ('Premier League', 'Leeds', 2, '2003-08-26 20:45'), ('Premier League', 'Leicester', 2, '2003-08-26 20:45'), ('Premier League', 'Liverpool', 2, '2003-08-27 21:00'), ('Premier League', 'Liverpool', 0, '2003-08-24 15:00'), ('Premier League', 'Middlesbrough', 2, '2003-08-26 20:45'), ('Premier League', 'Portsmouth', 1, '2003-08-30 16:00'), ('Premier League', 'Southampton', 2, '2003-08-26 20:45'), ('Premier League', 'Southampton', 0, '2003-08-23 16:00'), ('Premier League', 'Tottenham', 2, '2003-08-27 21:00'), ('Premier League', 'Wolves', 2, '2003-08-30 16:00')]

I know it's bad iterate over rows. How can I speed up these loops? Is it possible vectorize this function?

8
  • 1
    You should post on code review rather than on SO. Commented Jun 15, 2018 at 12:05
  • 2
    more importantly, you can provide some code to construct a sample dataframe, so that people can not just look the dataframe printout, but also run your code. stackoverflow.com/questions/20109391/… Commented Jun 15, 2018 at 13:01
  • @EvgenyPogrebnyak normally I would agree, but pandas has a nice builtin way to import dataframes from SO questions: df = pd.read_clipboard(sep="\s\s+") (I just had to adapt sep to not get fooled by single spaces in each cell) Commented Jun 15, 2018 at 14:16
  • 1
    @filippo: even though pd.read_clipboard is a nice feature, providing runnable code seems a responsiility of the OP, with a sample data included. you would still persist the dataframe in code (eg with .to_dict()), if you are editing/running it locally, to make the example reproducible. I would agree a printout is better than nothing, but it does not make a perfect question that is easy to deal with. Commented Jun 15, 2018 at 14:52
  • 1
    @EvgenyPogrebnyak completely agree! it's just I've seen a lot worse (screenshots?) here than I've grown to accept somewhat parsable printouts :-) Commented Jun 15, 2018 at 15:00

1 Answer 1

1

Use simple python call and remove some of pandas call.

def testTime():
        df = pd.DataFrame.from_dict(d)
        #print(df)
        #print(d.keys())
        lstOverdue = []
        lstTeams = [team for team in df.HomeTeam.unique()]
        for team in lstTeams:
                mask = (df.HomeTeam == team) | (df.AwayTeam == team)
                overdue = 0
                for row in df[mask].itertuples():
                        if row.GTOT < 1:
                                lstOverdue.append((row.League, team, overdue, row.Date))
                                overdue = 0
                        else:
                                overdue += 1
def testTime2():
        df = pd.DataFrame.from_dict(d)
        lstOverdue = []
        for team in df.HomeTeam.unique():
                overdue = 0
                for row in df.itertuples():
                        if row.HomeTeam == team or team == row.AwayTeam:
                                if row.GTOT < 1:
                                        lstOverdue.append((row.League, team, overdue, row.Date))
                                        overdue = 0
                                else:
                                        overdue += 1

if __name__ == '__main__':
        import timeit
        print(timeit.timeit("testTime2()", setup="from __main__ import testTime2", number=1000))
        print(timeit.timeit("testTime()", setup="from __main__ import testTime", number=1000))

see function testTime2, after running this code I observed around half of running time on my system with the data you've provided, but it can vary on your system as well.

$ python3.6 test.py
21.352469262998056
39.04840800600141
$ python3.6 test.py
22.977724283999123
39.568580347000534
$ python3.6 test.py
21.838805349998438
39.85868542999742

This can be further reduced to number of data rows but it requires some preprocessing as follows.

  1. Convert dictionary to list of namedtuples during conversion create set of HomeTown.

  2. Process the list similar to the function testTime2 process data frames.

Sign up to request clarification or add additional context in comments.

3 Comments

can you please replace the image with text?
@EvgenyPogrebnyak not sure why we have to do that, anyway changed to text.
Hmm...i try your funct, but without mask for df the execution time for DataFrame with 20000 rows increase x5.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.