Python - Looping through dataframe using methods other than .iterrows()

Question

Here is the simplified dataset:

   Character    x0    x1
0          T   0.0   1.0
1          h   1.1   2.1
2          i   2.2   3.2
3          s   3.3   4.3
5          i   5.5   6.5
6          s   6.6   7.6
8          a   8.8   9.8
10         s  11.0  12.0
11         a  12.1  13.1
12         m  13.2  14.2
13         p  14.3  15.3
14         l  15.4  16.4
15         e  16.5  17.5
16         .  17.6  18.6

The simplified dataset is generated by the following code:

ch = ['T']
x0 = [0]
x1 = [1]
string = 'his is a sample.'
for s in string:
    ch.append(s)
    x0.append(round(x1[-1]+0.1,1))
    x1.append(round(x0[-1]+1,1))

df = pd.DataFrame(list(zip(ch, x0, x1)), columns = ['Character', 'x0', 'x1'])
df = df.drop(df.loc[df['Character'] == ' '].index)

x0 and x1 represents the starting and ending position of each Character, respectively. Assume that the distance between any two adjacent characters equals to 0.1. In other words, if the difference between x0 of a character and x1 of the previous character is 0.1, the two characters belongs to the same string. If such difference is larger than 0.1, the character should be the start of a new string, etc. I need to produce a dataframe of strings and their respective x0 and x1, which is done by looping through the dataframe using .iterrows()

string = []
x0 = []
x1 = []
for index, row in df.iterrows():
    if index == 0:
        string.append(row['Character'])
        x0.append(row['x0'])
        x1.append(row['x1'])
    else:
        if round(row['x0']-x1[-1],1) == 0.1:
            string[-1] += row['Character']
            x1[-1] = row['x1']
        else:
            string.append(row['Character'])
            x0.append(row['x0'])
            x1.append(row['x1'])
df_string = pd.DataFrame(list(zip(string, x0, x1)), columns = ['String', 'x0', 'x1'])

Here is the result:

    String    x0    x1
0     This   0.0   4.3
1       is   5.5   7.6
2        a   8.8   9.8
3  sample.  11.0  18.6

Is there any other faster way to achieve this?

Dani Mesejo · Accepted Answer · 2020-12-28 16:26:34Z

1

You could use groupby + agg:

# create diff column
same = (df['x0'] - df['x1'].shift().fillna(df.at[0, 'x0'])).abs()

# create grouper column, had to use this because of problems with floating point
grouper = ((same - 0.1) > 0.00001).cumsum()

# group and aggregate accordingly
res = df.groupby(grouper).agg({ 'Character' : ''.join, 'x0' : 'first', 'x1' : 'last' })
print(res)

Output

  Character    x0    x1
0      This   0.0   4.3
1        is   5.5   7.6
2         a   8.8   9.8
3   sample.  11.0  18.6

The tricky part is this one:

# create grouper column, had to use this because of problems with floating point
grouper = ((same - 0.1) > 0.00001).cumsum()

The idea is to convert the column of diffs (same) into a True or False column, where every time a True appears it means a new group needs to be created. The cumsum will take care of assigning the same id to each group.

As suggested by @ShubhamSharma, you could do:

# create diff column
same = (df['x0'] - df['x1'].shift().fillna(df['x0'])).abs().round(3).gt(.1)

# create grouper column, had to use this because of problems with floating point
grouper = same.cumsum()

The other part remains the same.

edited Dec 28, 2020 at 16:26

answered Dec 28, 2020 at 15:58

Dani Mesejo

62.2k6 gold badges56 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Shubham Sharma Over a year ago

Nice answer, maybe you can round the values upto a fixed precision after subtracting like (df['x0'] - df['x1'].shift().fillna(df['x0'])).round(3).gt(.1)

Dani Mesejo Over a year ago

@ShubhamSharma Included your suggestion, thanks!

i.c Over a year ago

Nice one and thank you both. However it seems that iterrows() still runs faster (1.55 ms ± 34.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)) while groupby + agg gives (3 ms ± 147 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)). Is there any faster alternative?

Dani Mesejo Over a year ago

Hi @IvanC I'm really suprised that iterrows is faster, it's generaly very slow. In What data did you test that?

i.c Over a year ago

Hi @DaniMesejo, you are right! When I increase the dataset to around 17,000 rows, the iterrows runs 2.1 s ± 179 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) while groupby + agg gives (62.9 ms ± 1.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)).

|

Collectives™ on Stack Overflow

Python - Looping through dataframe using methods other than .iterrows()

1 Answer 1

8 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Related