Find first and last element in each pandas DataFrame row given an order for that row

Question

I have a pandas DataFrame with values in columns A, B, C, and D and want to determine for every row the first and last non-zero column. BUT the order of the elements is not the same for all rows. It is determined by columns item_0, item_1 and item_2.

While I can easily do this by applying a function to every row this becomes very slow for my DataFrame. Is there an elegant, more pythonic / pandasy way to do this?

Input:

   A  B  C  D item_0 item_1 item_2
0  1  2  0  0      A      B      C
1  0  1  1  0      A      B      C
2  1  0  1  0      A      B      C
3  0  2  0  0      D      A      B
4  1  1  0  1      D      A      B
5  0  0  0  1      D      A      B

Expected Output:

   A  B  C  D item_0 item_1 item_2 first last
0  1  2  0  0      A      B      C     A    B
1  0  1  1  0      A      B      C     B    C
2  1  0  1  0      A      B      C     A    C
3  0  2  0  0      D      A      B     B    B
4  1  1  0  1      D      A      B     D    B
5  0  0  0  1      D      A      B     D    D

Update: Here's the current code with apply

import pandas as pd


def first_and_last_for_row(row):
    reference_list = row[["item_0", "item_1", "item_2"]].tolist()
    list_to_sort = (
        row[["A", "B", "C", "D"]].index[row[["A", "B", "C", "D"]] > 0].tolist()
    )
    ordered_list = [l for l in reference_list if l in list_to_sort]
    if len(ordered_list) == 0:
        return None, None
    else:
        return ordered_list[0], ordered_list[-1]


df = pd.DataFrame(
    {
        "A": [1, 0, 1, 0, 1, 0],
        "B": [2, 1, 0, 2, 1, 0],
        "C": [0, 1, 1, 0, 0, 0],
        "D": [0, 0, 0, 0, 1, 1],
        "item_0": ["A", "A", "A", "D", "D", "D"],
        "item_1": ["B", "B", "B", "A", "A", "A"],
        "item_2": ["C", "C", "C", "B", "B", "B"],
    }
)

df[["first", "last"]] = df.apply(first_and_last_for_row, axis=1, result_type="expand")

Could you share the code/function you are currently using row by row? Also what "slow" means to you? Do you have time/memory constraints? How large is the dataframe itself? — FBruzzesi
– FBruzzesi, Commented May 12, 2022 at 10:55
Is there always a non zero among the selected columns? What should happen otherwise? — mozway
– mozway, Commented May 12, 2022 at 11:18
@FBruzzesi I updated the post to include the current code. "Slow" means ~5 minutes for ~600 000 rows. I expect the number of rows to grow in the future. I don't have hard time constraints but currently it's at the point where it is annoying and worth to spend time improving. — spettekaka
– spettekaka, Commented May 12, 2022 at 11:25
@mozway There can be (and are) all zero rows. First and last element can be considered nan in that case. But ignoring that special case is fine since I can just filter the DataFrame accordingly. — spettekaka
– spettekaka, Commented May 12, 2022 at 11:27

mozway · Accepted Answer · 2022-05-12 12:15:21Z

Here is a fully vectorized numpy approach. It's not very complex but has quite a few steps so I also provided a commented version of the code:

cols = ['A', 'B', 'C', 'D']
a = df[cols].to_numpy()

idx = df.filter(like='item_').replace({k:v for v,k in enumerate(cols)}).to_numpy()
b = a[np.arange(len(a))[:,None], idx] != 0
first = b.argmax(1)
last = b.shape[1]-np.fliplr(b).argmax(1)-1

c = df.filter(like='item_').to_numpy()
df[['first', 'last']] = c[np.arange(len(c))[:,None],
                          np.vstack((first, last)).T]

mask = b[np.arange(len(b)), first]
df[['first', 'last']] = df[['first', 'last']].where(pd.Series(mask, index=df.index))

commented code:

cols = ['A', 'B', 'C', 'D']

# convert to numpy array
a = df[cols].to_numpy()
# array([[1, 2, 0, 0],
#        [0, 1, 1, 0],
#        [1, 0, 1, 0],
#        [0, 2, 0, 0],
#        [1, 1, 0, 1],
#        [0, 0, 0, 1]])

# get indexer as numpy array
idx = df.filter(like='item_').replace({k:v for v,k in enumerate(cols)}).to_numpy()
# array([[0, 1, 2],
#        [0, 1, 2],
#        [0, 1, 2],
#        [3, 0, 1],
#        [3, 0, 1],
#        [3, 0, 1]])

# reorder columns and get non-zero
b = a[np.arange(len(a))[:,None], idx] != 0
# array([[ True,  True, False],
#        [False,  True,  True],
#        [ True, False,  True],
#        [False, False,  True],
#        [ True,  True,  True],
#        [ True, False, False]])

# first non-zero
first = b.argmax(1)
# array([0, 1, 0, 2, 0, 0])

# last non-zero
last = b.shape[1]-np.fliplr(b).argmax(1)-1
# array([1, 2, 2, 2, 2, 0])

# get back column names from position
c = df.filter(like='item_').to_numpy()
df[['first', 'last']] = c[np.arange(len(c))[:,None],
                          np.vstack((first, last)).T]

# optional
# define a mask in case a zero was selected
mask = b[np.arange(len(b)), first]
# array([ True,  True,  True,  True,  True,  True])
# mask where argmax was 0
df[['first', 'last']] = df[['first', 'last']].where(pd.Series(mask, index=df.index))

output:

   A  B  C  D item_0 item_1 item_2 first last
0  1  2  0  0      A      B      C     A    B
1  0  1  1  0      A      B      C     B    C
2  1  0  1  0      A      B      C     A    C
3  0  2  0  0      D      A      B     B    B
4  1  1  0  1      D      A      B     D    B
5  0  0  0  1      D      A      B     D    D

FBruzzesi · Accepted Answer · 2022-05-12 12:11:19Z

1

Let me try with a first attempt to "optimize", just by avoiding inner looping. The solution here is about 1.7x faster on 60k rows (I didn't have the patience to wait for 600k)

def first_and_last(row):
    
    # select order given by items 
    i0, i1, i2 = items = np.array(row[["item_0", "item_1", "item_2"]])
    
    # select values in right order
    v0, v1, v2 = values = np.array(row[[i0, i1, i2]])
    
    pos_values = (values > 0)
    n_positives = np.sum(values)
    
    if n_positives == 0:
        return np.nan, np.nan
    else:
        return items[pos_values][[0, -1]]

Then:

df_ = pd.concat([df]*10_000)

# Original function
%time df_.apply(first_and_last_for_row, axis=1, result_type="expand")
CPU times: user 53.3 s, sys: 22.5 ms, total: 53.4 s
Wall time: 53.4 s

# New function
%time df_.apply(first_and_last, axis=1, result_type="expand")
CPU times: user 32.9 s, sys: 0 ns, total: 32.9 s
Wall time: 32.9 s

However, apply method is not optimal, there are other ways to iterate over a dataframe. In particular, you can use itertuples method:

def first_and_last_iter(row):
    
    # select order given by items 
    i0, i1, i2 = items = np.array([getattr(row, "item_0"), getattr(row, "item_1"),getattr(row, "item_2")])
    
    # select values in right order
    v0, v1, v2 = values = np.array([getattr(row, i0), getattr(row, i1),getattr(row,i2)])
    
    pos_values = (values > 0)
    n_positives = np.sum(values)
    
    if n_positives == 0:
        return np.nan, np.nan
    else:
        return items[pos_values][[0, -1]]

%time df_[["first", "last"]] = [first_and_last_iter(row) for row in df_.itertuples()]
CPU times: user 1.05 s, sys: 0 ns, total: 1.05 s
Wall time: 1.05 s

And that's 50x improvement

answered May 12, 2022 at 12:11

FBruzzesi

6,6133 gold badges19 silver badges42 bronze badges

2 Comments

mozway Over a year ago

can you add the timing for the other answers? From my quick test, mine runs in 80ms for 60k, 600ms for 600k rows, 15s for 6M rows

FBruzzesi Over a year ago

Fully vectorized (your answer) takes ~70ms on 60k rows, really nice answer, of course it's the way to go! Good job!

g_dzt · Accepted Answer · 2022-05-12 12:35:30Z

Assuming your DataFrame is named df, here is something that works using filtering and no loops. It will work with all-zero lines too (value will be NaN in this case).

On my machine, it runs 10,000,000 rows in about 13 seconds.

# create filters stating if each column <item_n> is not zero
i0 = df.lookup(df.index, df.item_0).astype(bool)  # [True, False, True, False, True, True]
i1 = df.lookup(df.index, df.item_1).astype(bool)
i2 = df.lookup(df.index, df.item_2).astype(bool)

# for the "first" column, fill with value of item_0 if column is not zero
df['first'] = df.item_0[i0]  # ['A', NaN, 'A', NaN, 'D', 'D']
# fill the Nans with values of item_1 if column is not zero
df['first'][~i0 & i1] = df.item_1[~i0 & i1]
# fill the remaining Nans with values of item_2 if column is not zero
df['first'][~i0 & ~i1 & i2] = df.item_2[~i0 & ~i1 & i2]

# apply the same logic in reverse order for "last"
df['last'] = df.item_2[i2]
df['last'][~i2 & i1] = df.item_1[~i2 & i1]
df['last'][~i2 & ~i1 & i0] = df.item_0[~i2 & ~i1 & i0]

Output:

   A  B  C  D item_0 item_1 item_2 first last
0  1  2  0  0      A      B      C     A    B
1  0  1  1  0      A      B      C     B    C
2  1  0  1  0      A      B      C     A    C
3  0  2  0  0      D      A      B     B    B
4  1  1  0  1      D      A      B     D    B
5  0  0  0  1      D      A      B     D    D

Nikhil · Accepted Answer · 2022-05-12 12:22:54Z

df = pd.DataFrame(
{
    "A": [1, 0, 1, 0, 1, 0],
    "B": [2, 1, 0, 2, 1, 0],
    "C": [0, 1, 1, 0, 0, 0],
    "D": [0, 0, 0, 0, 1, 1],
    "item_0": ["A", "A", "A", "D", "D", "D"],
    "item_1": ["B", "B", "B", "B", "B", "B"],
    "item_2": ["C", "C", "C", "A", "A", "A"],
}

)

first = []
last = []
for i in range(df.shape[0]):
   check1 = []
   for j in df.columns:
       t1 = list(df.loc[i:i][j].values)[0]
       try:
          if t1 > 0:
             check1.append(j)
       except TypeError:
         continue

 if len(check1) == 2:
    first.append(check1[0])
    last.append(check1[1])
    check1.clear()
 elif len(check1) == 3:
    first.append(check1[2])
    last.append(check1[1])
    check1.clear()
 elif len(check1) == 1:
    first.append(check1[0])
    last.append(check1[0])
    check1.clear()

output:

G.G · Accepted Answer · 2022-11-10 10:19:35Z

0

good question

  def function1(ss:pd.Series):
        ss1=ss.loc[ss.iloc[4:].tolist()]
        ld1=lambda ss2:ss2.loc[lambda ss3:(ss3>0).cumsum()==1].head(1).index.values[0]
    
        return pd.Series([ld1(ss1),ld1(ss1[::-1])],index=['first','last'])
    
    df1.join(df1.apply(function1,axis=1))
    
    
      A  B  C  D item_0 item_1 item_2 first last
    0  1  2  0  0      A      B      C     A    B
    1  0  1  1  0      A      B      C     B    C
    2  1  0  1  0      A      B      C     A    C
    3  0  2  0  0      D      A      B     B    B
    4  1  1  0  1      D      A      B     D    B
    5  0  0  0  1      D      A      B     D    D

edited Nov 10, 2022 at 10:19

answered Nov 10, 2022 at 10:06

G.G

7654 silver badges5 bronze badges

1 Comment

G.G Over a year ago

@Spettekaka i think my answer is the best ,elegant and clear ：）

Collectives™ on Stack Overflow

Find first and last element in each pandas DataFrame row given an order for that row

5 Answers 5

Comments

2 Comments

Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

2 Comments

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related