6

I have a pandas DataFrame with values in columns A, B, C, and D and want to determine for every row the first and last non-zero column. BUT the order of the elements is not the same for all rows. It is determined by columns item_0, item_1 and item_2.

While I can easily do this by applying a function to every row this becomes very slow for my DataFrame. Is there an elegant, more pythonic / pandasy way to do this?

Input:

   A  B  C  D item_0 item_1 item_2
0  1  2  0  0      A      B      C
1  0  1  1  0      A      B      C
2  1  0  1  0      A      B      C
3  0  2  0  0      D      A      B
4  1  1  0  1      D      A      B
5  0  0  0  1      D      A      B

Expected Output:

   A  B  C  D item_0 item_1 item_2 first last
0  1  2  0  0      A      B      C     A    B
1  0  1  1  0      A      B      C     B    C
2  1  0  1  0      A      B      C     A    C
3  0  2  0  0      D      A      B     B    B
4  1  1  0  1      D      A      B     D    B
5  0  0  0  1      D      A      B     D    D

Update: Here's the current code with apply

import pandas as pd


def first_and_last_for_row(row):
    reference_list = row[["item_0", "item_1", "item_2"]].tolist()
    list_to_sort = (
        row[["A", "B", "C", "D"]].index[row[["A", "B", "C", "D"]] > 0].tolist()
    )
    ordered_list = [l for l in reference_list if l in list_to_sort]
    if len(ordered_list) == 0:
        return None, None
    else:
        return ordered_list[0], ordered_list[-1]


df = pd.DataFrame(
    {
        "A": [1, 0, 1, 0, 1, 0],
        "B": [2, 1, 0, 2, 1, 0],
        "C": [0, 1, 1, 0, 0, 0],
        "D": [0, 0, 0, 0, 1, 1],
        "item_0": ["A", "A", "A", "D", "D", "D"],
        "item_1": ["B", "B", "B", "A", "A", "A"],
        "item_2": ["C", "C", "C", "B", "B", "B"],
    }
)

df[["first", "last"]] = df.apply(first_and_last_for_row, axis=1, result_type="expand")
4
  • 1
    Could you share the code/function you are currently using row by row? Also what "slow" means to you? Do you have time/memory constraints? How large is the dataframe itself? Commented May 12, 2022 at 10:55
  • 1
    Is there always a non zero among the selected columns? What should happen otherwise? Commented May 12, 2022 at 11:18
  • 1
    @FBruzzesi I updated the post to include the current code. "Slow" means ~5 minutes for ~600 000 rows. I expect the number of rows to grow in the future. I don't have hard time constraints but currently it's at the point where it is annoying and worth to spend time improving. Commented May 12, 2022 at 11:25
  • @mozway There can be (and are) all zero rows. First and last element can be considered nan in that case. But ignoring that special case is fine since I can just filter the DataFrame accordingly. Commented May 12, 2022 at 11:27

5 Answers 5

4

Here is a fully vectorized numpy approach. It's not very complex but has quite a few steps so I also provided a commented version of the code:

cols = ['A', 'B', 'C', 'D']
a = df[cols].to_numpy()

idx = df.filter(like='item_').replace({k:v for v,k in enumerate(cols)}).to_numpy()
b = a[np.arange(len(a))[:,None], idx] != 0
first = b.argmax(1)
last = b.shape[1]-np.fliplr(b).argmax(1)-1

c = df.filter(like='item_').to_numpy()
df[['first', 'last']] = c[np.arange(len(c))[:,None],
                          np.vstack((first, last)).T]

mask = b[np.arange(len(b)), first]
df[['first', 'last']] = df[['first', 'last']].where(pd.Series(mask, index=df.index))

commented code:

cols = ['A', 'B', 'C', 'D']

# convert to numpy array
a = df[cols].to_numpy()
# array([[1, 2, 0, 0],
#        [0, 1, 1, 0],
#        [1, 0, 1, 0],
#        [0, 2, 0, 0],
#        [1, 1, 0, 1],
#        [0, 0, 0, 1]])

# get indexer as numpy array
idx = df.filter(like='item_').replace({k:v for v,k in enumerate(cols)}).to_numpy()
# array([[0, 1, 2],
#        [0, 1, 2],
#        [0, 1, 2],
#        [3, 0, 1],
#        [3, 0, 1],
#        [3, 0, 1]])

# reorder columns and get non-zero
b = a[np.arange(len(a))[:,None], idx] != 0
# array([[ True,  True, False],
#        [False,  True,  True],
#        [ True, False,  True],
#        [False, False,  True],
#        [ True,  True,  True],
#        [ True, False, False]])

# first non-zero
first = b.argmax(1)
# array([0, 1, 0, 2, 0, 0])

# last non-zero
last = b.shape[1]-np.fliplr(b).argmax(1)-1
# array([1, 2, 2, 2, 2, 0])

# get back column names from position
c = df.filter(like='item_').to_numpy()
df[['first', 'last']] = c[np.arange(len(c))[:,None],
                          np.vstack((first, last)).T]

# optional
# define a mask in case a zero was selected
mask = b[np.arange(len(b)), first]
# array([ True,  True,  True,  True,  True,  True])
# mask where argmax was 0
df[['first', 'last']] = df[['first', 'last']].where(pd.Series(mask, index=df.index))

output:

   A  B  C  D item_0 item_1 item_2 first last
0  1  2  0  0      A      B      C     A    B
1  0  1  1  0      A      B      C     B    C
2  1  0  1  0      A      B      C     A    C
3  0  2  0  0      D      A      B     B    B
4  1  1  0  1      D      A      B     D    B
5  0  0  0  1      D      A      B     D    D
Sign up to request clarification or add additional context in comments.

Comments

1

Let me try with a first attempt to "optimize", just by avoiding inner looping. The solution here is about 1.7x faster on 60k rows (I didn't have the patience to wait for 600k)

def first_and_last(row):
    
    # select order given by items 
    i0, i1, i2 = items = np.array(row[["item_0", "item_1", "item_2"]])
    
    # select values in right order
    v0, v1, v2 = values = np.array(row[[i0, i1, i2]])
    
    pos_values = (values > 0)
    n_positives = np.sum(values)
    
    if n_positives == 0:
        return np.nan, np.nan
    else:
        return items[pos_values][[0, -1]]

Then:

df_ = pd.concat([df]*10_000)

# Original function
%time df_.apply(first_and_last_for_row, axis=1, result_type="expand")
CPU times: user 53.3 s, sys: 22.5 ms, total: 53.4 s
Wall time: 53.4 s

# New function
%time df_.apply(first_and_last, axis=1, result_type="expand")
CPU times: user 32.9 s, sys: 0 ns, total: 32.9 s
Wall time: 32.9 s

However, apply method is not optimal, there are other ways to iterate over a dataframe. In particular, you can use itertuples method:

def first_and_last_iter(row):
    
    # select order given by items 
    i0, i1, i2 = items = np.array([getattr(row, "item_0"), getattr(row, "item_1"),getattr(row, "item_2")])
    
    # select values in right order
    v0, v1, v2 = values = np.array([getattr(row, i0), getattr(row, i1),getattr(row,i2)])
    
    pos_values = (values > 0)
    n_positives = np.sum(values)
    
    if n_positives == 0:
        return np.nan, np.nan
    else:
        return items[pos_values][[0, -1]]

%time df_[["first", "last"]] = [first_and_last_iter(row) for row in df_.itertuples()]
CPU times: user 1.05 s, sys: 0 ns, total: 1.05 s
Wall time: 1.05 s

And that's 50x improvement

2 Comments

can you add the timing for the other answers? From my quick test, mine runs in 80ms for 60k, 600ms for 600k rows, 15s for 6M rows
Fully vectorized (your answer) takes ~70ms on 60k rows, really nice answer, of course it's the way to go! Good job!
1

Assuming your DataFrame is named df, here is something that works using filtering and no loops. It will work with all-zero lines too (value will be NaN in this case).

On my machine, it runs 10,000,000 rows in about 13 seconds.

# create filters stating if each column <item_n> is not zero
i0 = df.lookup(df.index, df.item_0).astype(bool)  # [True, False, True, False, True, True]
i1 = df.lookup(df.index, df.item_1).astype(bool)
i2 = df.lookup(df.index, df.item_2).astype(bool)

# for the "first" column, fill with value of item_0 if column is not zero
df['first'] = df.item_0[i0]  # ['A', NaN, 'A', NaN, 'D', 'D']
# fill the Nans with values of item_1 if column is not zero
df['first'][~i0 & i1] = df.item_1[~i0 & i1]
# fill the remaining Nans with values of item_2 if column is not zero
df['first'][~i0 & ~i1 & i2] = df.item_2[~i0 & ~i1 & i2]

# apply the same logic in reverse order for "last"
df['last'] = df.item_2[i2]
df['last'][~i2 & i1] = df.item_1[~i2 & i1]
df['last'][~i2 & ~i1 & i0] = df.item_0[~i2 & ~i1 & i0]

Output:

   A  B  C  D item_0 item_1 item_2 first last
0  1  2  0  0      A      B      C     A    B
1  0  1  1  0      A      B      C     B    C
2  1  0  1  0      A      B      C     A    C
3  0  2  0  0      D      A      B     B    B
4  1  1  0  1      D      A      B     D    B
5  0  0  0  1      D      A      B     D    D

Comments

0
df = pd.DataFrame(
{
    "A": [1, 0, 1, 0, 1, 0],
    "B": [2, 1, 0, 2, 1, 0],
    "C": [0, 1, 1, 0, 0, 0],
    "D": [0, 0, 0, 0, 1, 1],
    "item_0": ["A", "A", "A", "D", "D", "D"],
    "item_1": ["B", "B", "B", "B", "B", "B"],
    "item_2": ["C", "C", "C", "A", "A", "A"],
}

)

first = []
last = []
for i in range(df.shape[0]):
   check1 = []
   for j in df.columns:
       t1 = list(df.loc[i:i][j].values)[0]
       try:
          if t1 > 0:
             check1.append(j)
       except TypeError:
         continue

 if len(check1) == 2:
    first.append(check1[0])
    last.append(check1[1])
    check1.clear()
 elif len(check1) == 3:
    first.append(check1[2])
    last.append(check1[1])
    check1.clear()
 elif len(check1) == 1:
    first.append(check1[0])
    last.append(check1[0])
    check1.clear()

output:

enter image description here

Comments

0

good question

  def function1(ss:pd.Series):
        ss1=ss.loc[ss.iloc[4:].tolist()]
        ld1=lambda ss2:ss2.loc[lambda ss3:(ss3>0).cumsum()==1].head(1).index.values[0]
    
        return pd.Series([ld1(ss1),ld1(ss1[::-1])],index=['first','last'])
    
    df1.join(df1.apply(function1,axis=1))
    
    
      A  B  C  D item_0 item_1 item_2 first last
    0  1  2  0  0      A      B      C     A    B
    1  0  1  1  0      A      B      C     B    C
    2  1  0  1  0      A      B      C     A    C
    3  0  2  0  0      D      A      B     B    B
    4  1  1  0  1      D      A      B     D    B
    5  0  0  0  1      D      A      B     D    D

1 Comment

@Spettekaka i think my answer is the best ,elegant and clear :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.