1

Here is the simplified dataset:

   Character    x0    x1
0          T   0.0   1.0
1          h   1.1   2.1
2          i   2.2   3.2
3          s   3.3   4.3
5          i   5.5   6.5
6          s   6.6   7.6
8          a   8.8   9.8
10         s  11.0  12.0
11         a  12.1  13.1
12         m  13.2  14.2
13         p  14.3  15.3
14         l  15.4  16.4
15         e  16.5  17.5
16         .  17.6  18.6

The simplified dataset is generated by the following code:

ch = ['T']
x0 = [0]
x1 = [1]
string = 'his is a sample.'
for s in string:
    ch.append(s)
    x0.append(round(x1[-1]+0.1,1))
    x1.append(round(x0[-1]+1,1))

df = pd.DataFrame(list(zip(ch, x0, x1)), columns = ['Character', 'x0', 'x1'])
df = df.drop(df.loc[df['Character'] == ' '].index)

x0 and x1 represents the starting and ending position of each Character, respectively. Assume that the distance between any two adjacent characters equals to 0.1. In other words, if the difference between x0 of a character and x1 of the previous character is 0.1, the two characters belongs to the same string. If such difference is larger than 0.1, the character should be the start of a new string, etc. I need to produce a dataframe of strings and their respective x0 and x1, which is done by looping through the dataframe using .iterrows()

string = []
x0 = []
x1 = []
for index, row in df.iterrows():
    if index == 0:
        string.append(row['Character'])
        x0.append(row['x0'])
        x1.append(row['x1'])
    else:
        if round(row['x0']-x1[-1],1) == 0.1:
            string[-1] += row['Character']
            x1[-1] = row['x1']
        else:
            string.append(row['Character'])
            x0.append(row['x0'])
            x1.append(row['x1'])
df_string = pd.DataFrame(list(zip(string, x0, x1)), columns = ['String', 'x0', 'x1'])

Here is the result:

    String    x0    x1
0     This   0.0   4.3
1       is   5.5   7.6
2        a   8.8   9.8
3  sample.  11.0  18.6

Is there any other faster way to achieve this?

1 Answer 1

1

You could use groupby + agg:

# create diff column
same = (df['x0'] - df['x1'].shift().fillna(df.at[0, 'x0'])).abs()

# create grouper column, had to use this because of problems with floating point
grouper = ((same - 0.1) > 0.00001).cumsum()

# group and aggregate accordingly
res = df.groupby(grouper).agg({ 'Character' : ''.join, 'x0' : 'first', 'x1' : 'last' })
print(res)

Output

  Character    x0    x1
0      This   0.0   4.3
1        is   5.5   7.6
2         a   8.8   9.8
3   sample.  11.0  18.6

The tricky part is this one:

# create grouper column, had to use this because of problems with floating point
grouper = ((same - 0.1) > 0.00001).cumsum()

The idea is to convert the column of diffs (same) into a True or False column, where every time a True appears it means a new group needs to be created. The cumsum will take care of assigning the same id to each group.

As suggested by @ShubhamSharma, you could do:

# create diff column
same = (df['x0'] - df['x1'].shift().fillna(df['x0'])).abs().round(3).gt(.1)

# create grouper column, had to use this because of problems with floating point
grouper = same.cumsum()

The other part remains the same.

Sign up to request clarification or add additional context in comments.

8 Comments

Nice answer, maybe you can round the values upto a fixed precision after subtracting like (df['x0'] - df['x1'].shift().fillna(df['x0'])).round(3).gt(.1)
@ShubhamSharma Included your suggestion, thanks!
Nice one and thank you both. However it seems that iterrows() still runs faster (1.55 ms ± 34.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)) while groupby + agg gives (3 ms ± 147 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)). Is there any faster alternative?
Hi @IvanC I'm really suprised that iterrows is faster, it's generaly very slow. In What data did you test that?
Hi @DaniMesejo, you are right! When I increase the dataset to around 17,000 rows, the iterrows runs 2.1 s ± 179 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) while groupby + agg gives (62.9 ms ± 1.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)).
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.