Pandas dataframe sub-selection

Question

I am new to programming and have taken up learning python in an attempt to make some tasks I run in my research more efficient. I am running a PCA in the pandas module (I found a tutorial online) and have the script for this, but need to subselect part of a dataframe prior to the pca.

so far I have (just for example in reality I am reading a .csv file with a larger matrix)

x = np.random.randint(30, size=(8,8))
df = pd.DataFrame(x)

    0   1   2   3   4   5   6   7
0   9   0  23  13   2   5  14   6
1  20  17  11  10  25  23  20  23
2  15  14  22  25  11  15   5  15
3   9  27  15  27   7  15  17  23
4  12   6  11  13  27  11  26  20
5  27  13   5  16   5   5   2  18
6   3  18  22   0   7  10  11  11
7  25  18  10  11  29  29   1  25

What I want to do is sub-select columns that satisfy a certain criteria in any of the rows, specifically I want every column that has at least one number =>27 (just for example) to produce a new dataframe

    0   1   3   4   5   
0   9   0  13   2   5  
1  20  17  10  25  23  
2  15  14  25  11  15   
3   9  27  27   7  15  
4  12   6  13  27  11  
5  27  13  16   5   5  
6   3  18   0   7  10  
7  25  18  11  29  29

I have looked into the various slicing methods in pandas but none seem to do what I want (.loc and .iloc etc.).

The actual script I am using to read in thus far is

filename = 'Data.csv' 
data = pd.read_csv(filename,sep = ',')
x = data.ix[:,1:] # variables - species
y = data.ix[:,0] # cases - age

so a sub dataframme of x is what I am after (as above).

Any advice is greatly appreciated.

score 1 · Accepted Answer · 2016-05-30 21:52:53Z

Indexers like loc, iloc, and ix accept boolean arrays. For example if you have three columns, df.loc[:, [True, False, True]] will return all the rows and the columns 0 and 2 (when corresponding value is True). You can check whether any of the elements in a column is greater than or equal to 27 by (df>=27).any(). This will return True for the columns that has at least one value >=27. So you can slice the dataframe with:

df.loc[:, (df>=27).any()]
Out[34]: 
    0   1   3   4   5   7
0   8   2  28   9  14  21
1  24  26  23  17   0   0
2   3  24   7  15   4  28
3  29  17  12   7   7   6
4   5   3  10  24  29  14
5  23  21   0  16  23  13
6  22  10  27   1   7  24
7   9  27   2  27  17  12

And this is the initial dataframe:

df
Out[35]: 
    0   1   2   3   4   5   6   7
0   8   2   7  28   9  14  26  21
1  24  26  15  23  17   0  21   0
2   3  24  26   7  15   4   7  28
3  29  17   9  12   7   7   0   6
4   5   3  13  10  24  29  22  14
5  23  21  26   0  16  23  17  13
6  22  10  19  27   1   7   9  24
7   9  27  26   2  27  17   8  12

Collectives™ on Stack Overflow

Pandas dataframe sub-selection

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related