2

I have a dataframe that looks like this:

>>> df = pd.DataFrame( {'InLevel_03': [12, 12, 13, 12, 11,], 'InLevel_02': [11.5, 11.5, 12.5, 11.5, 10.5], 'InLevel_01': [11, 10.5, 12, 10.5, 9], 'OutLevel_01': [10.5, 10, 11.5, 10, 8.5], 'OutLevel_02': [10, 9.5, 11, 9.5, 8], 'OutLevel_03': [9.5, 9, 10, 9, 7.5]} )

>>> df
   InLevel_03  InLevel_02  InLevel_01  OutLevel_01  OutLevel_02  OutLevel_03
0          12        11.5        11.0         10.5         10.0          9.5
1          12        11.5        10.5         10.0          9.5          9.0
2          13        12.5        12.0         11.5         11.0         10.0
3          12        11.5        10.5         10.0          9.5          9.0
4          11        10.5         9.0          8.5          8.0          7.5

If the given value is 0.5, I want to check if there's a gap bigger than the given value in a row. For example, in the 2nd row, there's a gap between InLevel_02(11.5) and InLevel_01(10.5), which is 11. In the 5th row, the gaps are 10 and 9.5, between InLevel_02(10.5) and InLevel_01(9.0).

The result of this job would look like this:

 gapLevel    count    # row number, column name of each gap
       11        2    # (1, InLevel_02 - 1, InLevel_01), (3, InLevel_02 - 3, InLevel_01)
     10.5        1    # (2, OutLevel_02 - 2, OutLevel_03)
       10        1    # (4, InLevel_02 - 4, InLevel_01)
      9.5        1    # (4, InLevel_02 - 4, InLevel_01)

I tried converting the dataframe into an array(using .to_records) and comparing each value with its next value using loops, but the code gets too complicated when there are more than 1 level between two values and I'd like to know if there are more efficient ways to do this.

5
  • 1
    understood what needs to be done...but can you just explain more about your output df? how is gapLevel matching with values in input df? Commented Jan 7, 2019 at 10:17
  • 1
    Also do you want only these two columns in your output df i.e. gapLevel and count. You don't want to merge these columns in orginial `df' Commented Jan 7, 2019 at 10:18
  • @Rahul Agarwal Thank you for the reply. I added how the result was made. I just counted missing gaps per level in the original dataframe. I think the result should be done in a new dataframe due to its different shape. Commented Jan 7, 2019 at 10:26
  • So the gaps are between values within a row, are the gaps between any paid of values within that row or only pairs which sit next to each other? Commented Jan 7, 2019 at 10:28
  • 1
    @cardamom I'm sorry for the lack of clearness of my question. I mean pairs which sit next to each other. You can see that levels in the first row decrease by 0.5 (12, 11.5, 11.0, 10.5 , 10.0 , 9.5). If it's not by 0.5 but 1(1 gap exists) or 1.5(2 gaps exist), it's not continuous and there's a gap. Commented Jan 7, 2019 at 10:35

1 Answer 1

1

Here's one approach:

You can begin by obtaining a list of indices of rows and columns from which to extract the counts checking where the df minus a shifted version of itself (see pd.shift) is greater than 0.5:

t = 0.5
# df = df.astype(float) # if it isn't already
rows, cols = np.where(df - df.shift(-1, axis = 1) > t)
# (array([1, 2, 3, 4]), array([1, 4, 1, 1]))

Get the arange from the values in these rows and columns using a list comprehension as (note that this approach assumes that the values keep decreasing throughout the columns):

v = [np.arange(*df.iloc[r,[c+1, c]].values, step=t)[1:] for r, c in zip(rows, cols)]
# [array([11.]), array([10.5]), array([11.]), array([ 9.5, 10. ])]

Create a new Series from the counts using Counter:

from itertools import chain
from collections import Counter

x = list(chain.from_iterable(v.values))
#[11.0, 10.5, 11.0, 9.5, 10.0]
pd.Series(Counter(x), name = 'count')

11.0    2
10.5    1
9.5     1
10.0    1
Name: count, dtype: int64
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you for your help! It works fine and I've learned some new ways to manipulate data thnaks to you. However, it takes minutes to do [np.arange(*df.iloc[r,[c+1, c]].values, step=t)[1:] for r, c in zip(rows, cols)] job with my (610178, 10) shaped dataframe. Could there be a more efficient way to do this?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.