Conditional column selection in pandas

Question

I want to select columns from a DataFrame according to a particular condition. I know it can be done with a loop, but my df is very large so efficiency is crucial. The condition for column selection is having either only non-nan entries or a sequence of only nans followed by a sequence of only non-nan entries.

Here is an example. Consider the following DataFrame:

pd.DataFrame([[1, np.nan, 2, np.nan], [2, np.nan, 5, np.nan], [4, 8, np.nan, 1], [3, 2, np.nan, 2], [3, 2, 5, np.nan]])

   0    1    2    3
0  1  NaN  2.0  NaN
1  2  NaN  5.0  NaN
2  4  8.0  NaN  1.0
3  3  2.0  NaN  2.0
4  3  2.0  5.0  NaN

From it, I would like to select only columns 0 and 1. Any advice on how to do this efficiently without looping?

piRSquared · Accepted Answer · 2016-11-06 15:44:58Z

2

logic

count the nulls in each column. if the only nulls are in the beginning, then the number of nulls in the column should be equal the the position of the first valid index.
get the first valid index
slice the index by the null count and compare against the first valid indices. If they are equal, then thats a good column

cnull = df.isnull().sum()
fvald = df.apply(pd.Series.first_valid_index)
cols = df.index[cnull] == fvald
df.loc[:, cols]

Edited with speed improvements

old answer

def pir1(df):
    cnull = df.isnull().sum()
    fvald = df.apply(pd.Series.first_valid_index)
    cols = df.index[cnull] == fvald
    return df.loc[:, cols]

much faster answer using same logic

def pir2(df):
    nulls = np.isnan(df.values)
    null_count = nulls.sum(0)
    first_valid = nulls.argmin(0)
    null_on_top = null_count == first_valid
    filtered_data = df.values[:, null_on_top]
    filtered_columns = df.columns.values[null_on_top]
    return pd.DataFrame(filtered_data, df.index, filtered_columns)

edited Nov 6, 2016 at 15:44

answered Nov 6, 2016 at 6:09

piRSquared

296k68 gold badges509 silver badges654 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

splinter Over a year ago

Thanks @piRSquared. This solution indeed gets the job done, but it takes more than 3 time longer to run than the solution posted below

piRSquared Over a year ago

@splinter I'm not surprised. I thought of going the route Nickil took, but I opted for brevity. Nickil provided a good answer. I'll update my post though using the same logic, but utilizing a few tricks to speed it up.

splinter Over a year ago

Sounds great @piRSquared

splinter Over a year ago

You are right, it is much faster. In my case it is almost 4 times faster than the solution proposed by NickiMaveli

Nickil Maveli · Accepted Answer · 2016-11-08 21:54:34Z

1

Consider a DF as shown which has Nans in various possible locations:

1. Both sides Nans present:

Create a mask by replacing all nans with 0's and finite values with 1's:

mask = np.where(np.isnan(df), 0, 1)

Take it's corresponding element difference across each column. Next, take modulus of it's values. Logic here is that whenever there are three unique values in each column, then discard that column(namely → -1,1,0) as there would be a break in the sequence for such a situation.

Idea is to take the sum and create a subset wherever the sum results in a value less than 2.(As after taking mod, we get 1,1,0). So, for the extreme case, we get sum as 2 and those columns certainly are disjoint and must be discarded.

criteria = pd.DataFrame(mask, columns=df.columns).diff(1).abs().sum().lt(2)

Finally transpose the DF and use this condition and re-transpose to get the desired result having only Nans in one portion and finite values in the other.

df.loc[:, criteria]

2. Nans present on top:

mask = np.where(np.isnan(df), 0, 1)
criteria = pd.DataFrame(mask, columns=df.columns).diff(1).ne(-1).any()
df.loc[:, criteria]

edited Nov 8, 2016 at 21:54

answered Nov 6, 2016 at 12:08

Nickil Maveli

29.8k10 gold badges86 silver badges88 bronze badges

1 Comment

splinter Over a year ago

Works great @NickiMaveli, and it does so 3 times fast than the solution above.

Collectives™ on Stack Overflow

Conditional column selection in pandas

2 Answers 2

4 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related