Pandas Select DataFrame columns using boolean

Question

I want to use a boolean to select the columns with more than 4000 entries from a dataframe comb which has over 1,000 columns. This expression gives me a Boolean (True/False) result:

criteria = comb.ix[:,'c_0327':].count()>4000

I want to use it to select only the True columns to a new Dataframe.
The following just gives me "Unalignable boolean Series key provided":

comb.loc[criteria,]

I also tried:

comb.ix[:, comb.ix[:,'c_0327':].count()>4000]

Similar to this question answer dataframe boolean selection along columns instead of row but that gives me the same error: "Unalignable boolean Series key provided"

comb.ix[:,'c_0327':].count()>4000

yields:

c_0327    False
c_0328    False
c_0329    False
c_0330    False
c_0331    False
c_0332    False
c_0333    False
c_0334    False
c_0335    False
c_0336    False
c_0337     True
c_0338    False
.....

comb[criteria.columns] gives me "'Series' object has no attribute 'columns'" — dartdog
– dartdog, Commented Mar 26, 2015 at 15:24

dartdog · Accepted Answer · 2015-03-26 16:33:20Z

50

What is returned is a Series with the column names as the index and the boolean values as the row values.

I think actually you want:

this should now work:

comb[criteria.index[criteria]]

Basically this uses the index values from criteria and the boolean values to mask them, this will return an array of column names, we can use this to select the columns of interest from the orig df.

edited Mar 26, 2015 at 16:33

dartdog

10.9k22 gold badges75 silver badges122 bronze badges

answered Mar 26, 2015 at 15:26

EdChum

397k204 gold badges836 silver badges583 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Areza Over a year ago

I am surprised to see, there is no shorter (more straightforward ) way of doing this.

johnDanger Over a year ago

There is, this answer is 5 years old and outdated. See my answer below for the straightforward way

c z Over a year ago

@johnDanger Nice answer, but I'm not sure I'd agree that going from "m[f] for row filtering" to "m.loc[:,f] for column filtering" is straightforward.

chickenNinja123 Over a year ago

The "straightforwardness" in @johnDanger's answer is that you only need criteria once, and hence you do not need to define the variable separately (but can just use the expression itself ion m.loc[:, expression_of_criteria_itself]).

johnDanger · Accepted Answer · 2020-12-11 16:53:52Z

38

In pandas 0.25:

comb.loc[:, criteria]

Returns a DataFrame with columns selected by the Boolean list or Series.

For multiple criteria:

comb.loc[:, criteria1 & criteria2]

And for selecting rows with an index criteria:

comb[criteria]

Note: The bit-wise operator & is required (not and). See Logical operators for boolean indexing in Pandas.

Other Note: If the criteria is an expression (e.g., comb.columnX > 3), and multiple criteria are used, remember to enclose each expression in parentheses! This is because &, | have higher precedence than >, ==, ect. (whereas and, or are lower precedence).

edited Dec 11, 2020 at 16:53

answered Aug 20, 2019 at 16:31

johnDanger

2,37920 silver badges25 bronze badges

Comments

jberrio · Accepted Answer · 2018-10-27 05:33:50Z

7

You can also use:

# To filter columns (assuming criteria length is equal to the number of columns of comb)
comb.ix[:, criteria]
comb.iloc[:, criteria.values]

# To filter rows (assuming criteria length is equal to the number of rows of comb)
comb[criteria]

edited Oct 27, 2018 at 5:33

jberrio

1,1342 gold badges12 silver badges22 bronze badges

answered Jan 24, 2017 at 12:14

Yohan Obadia

2,7122 gold badges28 silver badges32 bronze badges

2 Comments

Mischa Lisovyi Over a year ago

The first answer looks the most elegant for masked column selection. The only trick is that one needs to do comb.iloc[:, criteria.values], as a series is not a valid argument into iloc slicing of this type

Yohan Obadia Over a year ago

I should have specified that I expected criteria to be a boolean list. Good catch.

Giorgos Myrianthous · Accepted Answer · 2021-06-25 14:04:40Z

3

You can pass a boolean array to loc to indicate which columns should be kept and which not.

For example,

>>> df
    A   B   C   D    E
0  73  15  55  33  foo
1  63  64  11  11  bar
2  56  72  57  55  foo

>>> df.loc[:, [True, True, False, False, True]]
    A   B    E
0  73  15  foo
1  63  64  bar
2  56  72  foo

answered Jun 25, 2021 at 14:04

Giorgos Myrianthous

40.4k21 gold badges156 silver badges175 bronze badges

Comments

Krishna · Accepted Answer · 2018-09-21 04:05:25Z

1

I'm using this, it's cleaner

comb.values[:,criteria]

credit: https://stackoverflow.com/a/43291257/815677

answered Sep 21, 2018 at 4:05

Krishna

4251 gold badge4 silver badges11 bronze badges

1 Comment

Keith Hughitt Over a year ago

Just to be clear, this returns an numpy.ndarray, and not a pandas.DataFrame.

Seth Johnson · Accepted Answer · 2020-03-30 23:19:09Z

-1

Another solution is to transpose comb to make its columns act as its index, then transpose on the resulting subset:

comb.T[criteria].T

Again, not particularly elegant, but at least shorter/less repetitive than the leading solution.

answered Mar 30, 2020 at 23:19

Seth Johnson

15.3k8 gold badges63 silver badges89 bronze badges

4 Comments

Jean Paul Over a year ago

There are already proposed solutions which are shorter/less repetitive than the accepted solution but also more elegant.

william_grisaitis Over a year ago

Seconding @JeanPaul... best to avoid transposes

Seth Johnson Over a year ago

@william_grisaitis What's the problem with transposes? Are they memory/compute intensive, or do you just find the T aesthetically displeasing, or ...?

william_grisaitis Over a year ago

@SethJohnson they can be really slow. not an expert, but that's my experience. if i had to guess, it's reallocating memory under the hood for everything (not a zero-copy operation).

william_grisaitis · Accepted Answer · 2022-03-15 20:16:04Z

-1

Another approach is to use Python's built-in filter function:

def satisfies_criteria(column):
    return comb[column].count() > 4000


cols = filter(satisfies_criteria, df.columns)
df[cols]

answered Mar 15, 2022 at 20:16

william_grisaitis

6,1304 gold badges46 silver badges57 bronze badges

Collectives™ on Stack Overflow

Pandas Select DataFrame columns using boolean

7 Answers 7

4 Comments

Comments

2 Comments

Comments

1 Comment

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

4 Comments

Comments

2 Comments

Comments

1 Comment

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related