0

I been using the pandas library, and crosstab to create a frequency Dataframe to work with Data. In the following code I read in a csv, create a dataframe then create a crosstab which is a frequency dataframe. Then I get a cross-section of the data to pull out columns and the data beneath.

def dataforgraphs():
    d = readcsv()
    df = DataFrame(d)
    d1=df[1]
    d0=df[0]
    d2=df[2]
    d3=df[3]
    d4=df[4]


    cta = pd.crosstab(d0,[d2,d1,d3],rownames=['Date'],colnames=['RigStat','Prov','Obj'],   margins=False)

    ndfABA= ndf.xs('AB', level='Prov', axis=1)
    ABrigs = ndfAB.xs(['BIT','GAS','OIL'],axis=1)

Now from here I have the issue of not being able to pull the cross section on the hypothetical column that would include all the blank values that did not have the label 'BIT','GAS' or 'OIL'. In an excel pivot table, I can do this by checking the (blank) box when selecting the columns to be included in a pivot table. I want to do the same thing here to get a frequency count of all those that are blank.

Any suggestions?

Currently I get the following output, which only has the three column specified and the frequencies below.

            OIL   GAS   BIT
Date  
01-01-2007   1     6     3
01-02-2007   2     4     4
01-03-2007   1     6     3
01-04-2007   5     6     4
01-05-2007   1     7     3
01-06-2007   6     6     6
01-07-2007   1     8     3
01-08-2007   5     6     6
01-09-2007   1     6     3
01-10-2007   1     7     3

Instead, I would like to get the following, which includes a column for all blank values not listed as OIL,GAS or BIT (or listed as anything for that matter).

            OIL   GAS   BIT  "blank'
Date  
01-01-2007   1     6     3     10
01-02-2007   2     4     4     11
01-03-2007   1     6     3     12
01-04-2007   5     6     4     10
01-05-2007   1     7     3      1
01-06-2007   6     6     6      4
01-07-2007   1     8     3      5
01-08-2007   5     6     6      2
01-09-2007   1     6     3      5
01-10-2007   1     7     3      2

The Data going into the pandas crosstab dataframe is structured like the following:

Date         Obj  Operator  Type  Address
01-01-2007   OIL   ABC      HZ    112 W Ave
01-01-2007   GAS   ABC      HZ    112 W Ave
01-01-2007   GAS   ABV      HZ    113 W Ave
01-01-2007   BIT   NCH      HZ    114 W Ave
01-01-2007         CNR      HZ    115 W Ave
01-02-2007   OIL   CNRL     HZ    112 W Ave
01-02-2007   OIL   CNRL     HZ    112 W Ave
01-02-2007   OIL   CNRL     HZ    112 W Ave
01-03-2007         CNRL     HZ    112 W Ave
01-03-2007         CNRL     HZ    112 W Ave

From here, pandas crosstab would create a frequency table that would capture the frquency of OIL, GAS, BIT by date, but I cant find how to get the blank value count.Notice how there are some columns that dont have an Obj listed. These are the values that are not captured in the crosstab that I would like to be able to query.

Any suggestions?

4
  • 1
    Can you provide a reproducible example with real data and show the expected output? Commented Jul 12, 2014 at 9:19
  • There I made some edits to clearify. Commented Jul 14, 2014 at 16:40
  • 1
    It would be easier that you provide some example data that reproduces the problem (just some random data in the same structure is OK). As it is still not very clear to me. Commented Jul 14, 2014 at 17:00
  • Mostly the issue is in understanding exactly what pandas crosstab does with the raw data when putting it into a frequency table. Here I will include an example of the data so that you can see what pd.crosstab is initially working with. Commented Jul 14, 2014 at 17:04

2 Answers 2

3

One possibility is to fill the NaN values with the desired string (eg 'blank'), so they are also counted:

In [23]: df
Out[23]: 
         Date  Obj Operator Type    Address
0  01-01-2007  OIL      ABC   HZ  112 W Ave
1  01-01-2007  GAS      ABC   HZ  112 W Ave
2  01-01-2007  GAS      ABV   HZ  113 W Ave
3  01-01-2007  BIT      NCH   HZ  114 W Ave
4  01-01-2007  NaN      CNR   HZ  115 W Ave
5  01-02-2007  OIL     CNRL   HZ  112 W Ave
6  01-02-2007  OIL     CNRL   HZ  112 W Ave
7  01-02-2007  OIL     CNRL   HZ  112 W Ave
8  01-03-2007  NaN     CNRL   HZ  112 W Ave
9  01-03-2007  NaN     CNRL   HZ  112 W Ave

In [24]: pd.crosstab(df['Date'], df['Obj'])
Out[24]: 
Obj         BIT  GAS  OIL
Date                     
01-01-2007    1    2    1
01-02-2007    0    0    3

In [25]: df2 = df.fillna('blank')

In [26]: pd.crosstab(df2['Date'], df2['Obj'])
Out[26]: 
Obj         BIT  GAS  OIL  blank
Date                            
01-01-2007    1    2    1      1
01-02-2007    0    0    3      0
01-03-2007    0    0    0      2

What the crosstab actually does is just grouping by the row and column values (to become the row and column indices) you provided, and count the frequency of this.

Sign up to request clarification or add additional context in comments.

Comments

0

Reindex your confusion matrix and fill zeros in those positions.

df_confusion = pd.crosstab(y_actual, y_predicted).reindex(columns=[0,1],index=[0,1], fill_value=0)

Specify the rows and columns in index and columns attribute and set fill_value = 0

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.