1

I am new to programming and have taken up learning python in an attempt to make some tasks I run in my research more efficient. I am running a PCA in the pandas module (I found a tutorial online) and have the script for this, but need to subselect part of a dataframe prior to the pca.

so far I have (just for example in reality I am reading a .csv file with a larger matrix)

x = np.random.randint(30, size=(8,8))
df = pd.DataFrame(x)

    0   1   2   3   4   5   6   7
0   9   0  23  13   2   5  14   6
1  20  17  11  10  25  23  20  23
2  15  14  22  25  11  15   5  15
3   9  27  15  27   7  15  17  23
4  12   6  11  13  27  11  26  20
5  27  13   5  16   5   5   2  18
6   3  18  22   0   7  10  11  11
7  25  18  10  11  29  29   1  25

What I want to do is sub-select columns that satisfy a certain criteria in any of the rows, specifically I want every column that has at least one number =>27 (just for example) to produce a new dataframe

    0   1   3   4   5   
0   9   0  13   2   5  
1  20  17  10  25  23  
2  15  14  25  11  15   
3   9  27  27   7  15  
4  12   6  13  27  11  
5  27  13  16   5   5  
6   3  18   0   7  10  
7  25  18  11  29  29  

I have looked into the various slicing methods in pandas but none seem to do what I want (.loc and .iloc etc.).

The actual script I am using to read in thus far is

filename = 'Data.csv' 
data = pd.read_csv(filename,sep = ',')
x = data.ix[:,1:] # variables - species
y = data.ix[:,0] # cases - age

so a sub dataframme of x is what I am after (as above).

Any advice is greatly appreciated.

1 Answer 1

1

Indexers like loc, iloc, and ix accept boolean arrays. For example if you have three columns, df.loc[:, [True, False, True]] will return all the rows and the columns 0 and 2 (when corresponding value is True). You can check whether any of the elements in a column is greater than or equal to 27 by (df>=27).any(). This will return True for the columns that has at least one value >=27. So you can slice the dataframe with:

df.loc[:, (df>=27).any()]
Out[34]: 
    0   1   3   4   5   7
0   8   2  28   9  14  21
1  24  26  23  17   0   0
2   3  24   7  15   4  28
3  29  17  12   7   7   6
4   5   3  10  24  29  14
5  23  21   0  16  23  13
6  22  10  27   1   7  24
7   9  27   2  27  17  12

And this is the initial dataframe:

df
Out[35]: 
    0   1   2   3   4   5   6   7
0   8   2   7  28   9  14  26  21
1  24  26  15  23  17   0  21   0
2   3  24  26   7  15   4   7  28
3  29  17   9  12   7   7   0   6
4   5   3  13  10  24  29  22  14
5  23  21  26   0  16  23  17  13
6  22  10  19  27   1   7   9  24
7   9  27  26   2  27  17   8  12
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.