1

I'm a newbie to Pandas and I'm trying to apply it to a script that I have already written. I have a csv file from which I extract the data, and use the columns 'candidate', 'final track' and 'status' for my data frame.

My problem is, I would like to filter the data, using perhaps the method shown in Wes Mckinney's 10min tutorial ('http://nbviewer.ipython.org/urls/gist.github.com/wesm/4757075/raw/a72d3450ad4924d0e74fb57c9f62d1d895ea4574/PandasTour.ipynb'). In the section In [80]: he uses aapl_bars.close_price['2009-10-15'].

I would like to use a similar method to select all the data which have * as a status. Data from the other columns are also deleted if there is no * in that row.

My code at the moment:

def establish_current_tacks(filename):

    df=pd.read_csv(filename)    
    cols=[df.iloc[:,0], df.iloc[:,10], df.iloc[:,11]]
    current_tracks=pd.concat(cols, axis=1)
    return current_tracks

My DataFrame:

>>> current_tracks
<class 'pandas.core.frame.DataFrame'>
Int64Index: 707 entries, 0 to 706
Data columns (total 3 columns):
candidate       695  non-null values
 final track    670  non-null values
 status         670  non-null values
dtypes: float64(1), object(2)

I would like to use something such as current_tracks.status['*'], but this does not work

Apologies if this is obvious, struggling a little to get my head around it.

3
  • 1
    As you filtering your cols after reading the csv, it would be more efficient to filter on the call to read_csv like so df=read_csv(filename, usecols=[0,10,11]) or you can pass a list of the columns names df=read_csv(filename, usecols=['candidate', 'final track', 'status']) it will will load much quicker Commented Sep 24, 2013 at 11:14
  • Cheers Ed. I've got two questions, 1. the 2 methods you explained do not produce the same result?(the 1st gives the one I want).2.) how would you index these columns with the column names? Commented Sep 24, 2013 at 11:26
  • I assumed the column names were the same as what your output was, it's possible they are not, that would explain 1. The second code snippet is the correct syntax and form but for some reason is did not produce the same result which you would need to understand why, you can read more about the parameters here Commented Sep 24, 2013 at 11:59

1 Answer 1

2

Since the data you want to filter based on is not part of the data frame's index, but instead is a regular column, you need to do something like this:

current_tracks[current_tracks.status == '*']

Full example:

import pandas as pd
current_tracks = pd.DataFrame({'candidate': ['Bob', 'Jim', 'Alice'],
'final_track': [10, 15, 13], 'status': ['*', '.', '*']})
current_tracks
Out[3]: 
  candidate  final_track status
0       Bob           10      *
1       Jim           15      .
2     Alice           13      *

current_tracks[current_tracks.status == '*']
Out[4]: 
  candidate  final_track status
0       Bob           10      *
2     Alice           13      *

If status was part of your dataframe's index, your original syntax would have worked:

current_tracks = current_tracks.set_index('status')
current_tracks.candidate['*']
Out[8]: 
status
*           Bob
*         Alice
Name: candidate, dtype: object
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks @Marius I'm trying to use the secondset_index part. I get the error KeyError: u'no item named status'
Looking it at your DataFrame, it looks like the 'status' column is actually named ' status' with a leading space. Maybe you parsed it funny? Try renaming the columns explicitly (current_tracks.columns = ['candidate', 'final_track', 'status']) and see if it works.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.