How to compare dataframe columns using a given value?

Question

I have a dataframe that looks like this:

>>> df = pd.DataFrame( {'InLevel_03': [12, 12, 13, 12, 11,], 'InLevel_02': [11.5, 11.5, 12.5, 11.5, 10.5], 'InLevel_01': [11, 10.5, 12, 10.5, 9], 'OutLevel_01': [10.5, 10, 11.5, 10, 8.5], 'OutLevel_02': [10, 9.5, 11, 9.5, 8], 'OutLevel_03': [9.5, 9, 10, 9, 7.5]} )

>>> df
   InLevel_03  InLevel_02  InLevel_01  OutLevel_01  OutLevel_02  OutLevel_03
0          12        11.5        11.0         10.5         10.0          9.5
1          12        11.5        10.5         10.0          9.5          9.0
2          13        12.5        12.0         11.5         11.0         10.0
3          12        11.5        10.5         10.0          9.5          9.0
4          11        10.5         9.0          8.5          8.0          7.5

If the given value is 0.5, I want to check if there's a gap bigger than the given value in a row. For example, in the 2nd row, there's a gap between InLevel_02(11.5) and InLevel_01(10.5), which is 11. In the 5th row, the gaps are 10 and 9.5, between InLevel_02(10.5) and InLevel_01(9.0).

The result of this job would look like this:

 gapLevel    count    # row number, column name of each gap
       11        2    # (1, InLevel_02 - 1, InLevel_01), (3, InLevel_02 - 3, InLevel_01)
     10.5        1    # (2, OutLevel_02 - 2, OutLevel_03)
       10        1    # (4, InLevel_02 - 4, InLevel_01)
      9.5        1    # (4, InLevel_02 - 4, InLevel_01)

I tried converting the dataframe into an array(using .to_records) and comparing each value with its next value using loops, but the code gets too complicated when there are more than 1 level between two values and I'd like to know if there are more efficient ways to do this.

understood what needs to be done...but can you just explain more about your output df? how is gapLevel matching with values in input df? — Rahul Agarwal
– Rahul Agarwal, Commented Jan 7, 2019 at 10:17
Also do you want only these two columns in your output df i.e. gapLevel and count. You don't want to merge these columns in orginial `df' — Rahul Agarwal
– Rahul Agarwal, Commented Jan 7, 2019 at 10:18
@Rahul Agarwal Thank you for the reply. I added how the result was made. I just counted missing gaps per level in the original dataframe. I think the result should be done in a new dataframe due to its different shape. — maynull
– maynull, Commented Jan 7, 2019 at 10:26
So the gaps are between values within a row, are the gaps between any paid of values within that row or only pairs which sit next to each other? — cardamom
– cardamom, Commented Jan 7, 2019 at 10:28
@cardamom I'm sorry for the lack of clearness of my question. I mean pairs which sit next to each other. You can see that levels in the first row decrease by 0.5 (12, 11.5, 11.0, 10.5 , 10.0 , 9.5). If it's not by 0.5 but 1(1 gap exists) or 1.5(2 gaps exist), it's not continuous and there's a gap. — maynull
– maynull, Commented Jan 7, 2019 at 10:35

yatu · Accepted Answer · 2019-01-07 11:10:27Z

1

Here's one approach:

You can begin by obtaining a list of indices of rows and columns from which to extract the counts checking where the df minus a shifted version of itself (see pd.shift) is greater than 0.5:

t = 0.5
# df = df.astype(float) # if it isn't already
rows, cols = np.where(df - df.shift(-1, axis = 1) > t)
# (array([1, 2, 3, 4]), array([1, 4, 1, 1]))

Get the arange from the values in these rows and columns using a list comprehension as (note that this approach assumes that the values keep decreasing throughout the columns):

v = [np.arange(*df.iloc[r,[c+1, c]].values, step=t)[1:] for r, c in zip(rows, cols)]
# [array([11.]), array([10.5]), array([11.]), array([ 9.5, 10. ])]

Create a new Series from the counts using Counter:

from itertools import chain
from collections import Counter

x = list(chain.from_iterable(v.values))
#[11.0, 10.5, 11.0, 9.5, 10.0]
pd.Series(Counter(x), name = 'count')

11.0    2
10.5    1
9.5     1
10.0    1
Name: count, dtype: int64

edited Jan 7, 2019 at 11:10

answered Jan 7, 2019 at 10:36

yatu

88.7k12 gold badges93 silver badges148 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

maynull Over a year ago

Thank you for your help! It works fine and I've learned some new ways to manipulate data thnaks to you. However, it takes minutes to do [np.arange(*df.iloc[r,[c+1, c]].values, step=t)[1:] for r, c in zip(rows, cols)] job with my (610178, 10) shaped dataframe. Could there be a more efficient way to do this?

Collectives™ on Stack Overflow

How to compare dataframe columns using a given value?

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related