3

I want to select columns from a DataFrame according to a particular condition. I know it can be done with a loop, but my df is very large so efficiency is crucial. The condition for column selection is having either only non-nan entries or a sequence of only nans followed by a sequence of only non-nan entries.

Here is an example. Consider the following DataFrame:

pd.DataFrame([[1, np.nan, 2, np.nan], [2, np.nan, 5, np.nan], [4, 8, np.nan, 1], [3, 2, np.nan, 2], [3, 2, 5, np.nan]])

   0    1    2    3
0  1  NaN  2.0  NaN
1  2  NaN  5.0  NaN
2  4  8.0  NaN  1.0
3  3  2.0  NaN  2.0
4  3  2.0  5.0  NaN

From it, I would like to select only columns 0 and 1. Any advice on how to do this efficiently without looping?

2 Answers 2

2

logic

  • count the nulls in each column. if the only nulls are in the beginning, then the number of nulls in the column should be equal the the position of the first valid index.
  • get the first valid index
  • slice the index by the null count and compare against the first valid indices. If they are equal, then thats a good column

cnull = df.isnull().sum()
fvald = df.apply(pd.Series.first_valid_index)
cols = df.index[cnull] == fvald
df.loc[:, cols]

enter image description here


Edited with speed improvements

old answer

def pir1(df):
    cnull = df.isnull().sum()
    fvald = df.apply(pd.Series.first_valid_index)
    cols = df.index[cnull] == fvald
    return df.loc[:, cols]

much faster answer using same logic

def pir2(df):
    nulls = np.isnan(df.values)
    null_count = nulls.sum(0)
    first_valid = nulls.argmin(0)
    null_on_top = null_count == first_valid
    filtered_data = df.values[:, null_on_top]
    filtered_columns = df.columns.values[null_on_top]
    return pd.DataFrame(filtered_data, df.index, filtered_columns)

enter image description here

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks @piRSquared. This solution indeed gets the job done, but it takes more than 3 time longer to run than the solution posted below
@splinter I'm not surprised. I thought of going the route Nickil took, but I opted for brevity. Nickil provided a good answer. I'll update my post though using the same logic, but utilizing a few tricks to speed it up.
Sounds great @piRSquared
You are right, it is much faster. In my case it is almost 4 times faster than the solution proposed by NickiMaveli
1

Consider a DF as shown which has Nans in various possible locations:

Image

1. Both sides Nans present:

Create a mask by replacing all nans with 0's and finite values with 1's:

mask = np.where(np.isnan(df), 0, 1)

Take it's corresponding element difference across each column. Next, take modulus of it's values. Logic here is that whenever there are three unique values in each column, then discard that column(namely → -1,1,0) as there would be a break in the sequence for such a situation.

Idea is to take the sum and create a subset wherever the sum results in a value less than 2.(As after taking mod, we get 1,1,0). So, for the extreme case, we get sum as 2 and those columns certainly are disjoint and must be discarded.

criteria = pd.DataFrame(mask, columns=df.columns).diff(1).abs().sum().lt(2)

Finally transpose the DF and use this condition and re-transpose to get the desired result having only Nans in one portion and finite values in the other.

df.loc[:, criteria]

Image

2. Nans present on top:

mask = np.where(np.isnan(df), 0, 1)
criteria = pd.DataFrame(mask, columns=df.columns).diff(1).ne(-1).any()
df.loc[:, criteria]

Image

1 Comment

Works great @NickiMaveli, and it does so 3 times fast than the solution above.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.