Create advanced frequency table with Python

Question

I am trying to make a frequency table based on a dataframe with pandas and Python. In fact it's exactly the same as a previous question of mine which used R.

Let's say that I have a dataframe in pandas that looks like this (in fact the dataframe is much larger, but for illustrative purposes I limited the rows):

node    |   precedingWord
-------------------------
A-bom       de
A-bom       die
A-bom       de
A-bom       een
A-bom       n
A-bom       de
acroniem    het
acroniem    t
acroniem    het
acroniem    n
acroniem    een
act         de
act         het
act         die
act         dat
act         t
act         n

I'd like to use these values to make a count of the precedingWords per node, but with subcategories. For instance: one column to add values to that is titled neuter, another non-neuter and a last one rest. neuter would contain all values for which precedingWord is one of these values: t,het, dat. non-neuter would contain de and die, and rest would contain everything that doesn't belong into neuter or non-neuter. (It would be nice if this could be dynamic, in other words that rest uses some sort of reversed variable that is used for neuter and non-neuter. Or which simply subtracts the values in neuter and non-neuter from the length of rows with that node.)

Example output (in a new dataframe, let's say freqDf, would look like this:

node    |   neuter   | nonNeuter   | rest
-----------------------------------------
A-bom       0          4             2
acroniem    3          0             2
act         3          2             1

I found an answer to a similar question but the use case isn't exactly the same. It seems to me that in that question all variables are independent. However, in my case it is obvious that I have multiple rows with the same node, which should all be brought down to a single one frequency - as show in the expected output above.

I thought something like this (untested):

def specificFreq(d):  
    for uniqueWord in d['node']
        return pd.Series({'node': uniqueWord ,
            'neuter': sum(d['node' == uniqueWord] & d['precedingWord'] == 't|het|dat'),
            'nonNeuter':  sum(d['node' == uniqueWord] & d['precedingWord'] == 'de|die'),
            'rest': len(uniqueWord) - neuter - nonNeuter}) # Length of rows with the specific word, distracted by neuter and nonneuter values above

df.groupby('node').apply(specificFreq)

But I highly doubt this the correct way of doing something like this.

jrjc · Accepted Answer · 2015-08-20 09:44:33Z

2

As proposed in the R solution, you can first change the name and then perform the cross tabulation:

df.loc[df.precedingWord.isin(neuter), "gender"] = "neuter"
df.loc[df.precedingWord.isin(non_neuter), "gender"] = "non_neuter"
df.loc[df.precedingWord.isin(neuter + non_neuter)==0, "gender"] = "rest"
# neuter + non_neuter is the concatenation of both lists.

pd.crosstab(df.node, df.gender)
gender    neuter  non_neuter  rest
node                              
A-bom          0           4     2
acroniem       3           0     2
act            3           2     1

This one is better because if a word in neuter or non_neuter is not present in precedingword, it won't raise a KeyError like in the former solution.

Former solution, less clean.

Given your dataframe, you can make a simple cross tabulation:

ct = pd.crosstab(df.node, df.precedingWord)

which gives:

pW        dat  de  die  een  het  n  t
node                                  
A-bom       0   3    1    1    0  1  0
acroniem    0   0    0    1    2  1  1
act         1   1    1    0    1  1  1

Then, you just want to sum certain columns together:

neuter = ["t", "het", "dat"]
non_neuter = ["de","die"]
freqDf = pd.DataFrame()

freqDf["neuter"] = ct[neuter].sum(axis=1)
ct.drop(neuter, axis=1, inplace=1)

freqDf["non_neuter"] = ct[non_neuter].sum(axis=1)
ct.drop(non_neuter, axis=1, inplace=1)

freqDf["rest"] = ct.sum(axis=1)

Which gives you freqDf:

          neuter  non_neuter  rest
node                              
A-bom          0           4     2
acroniem       3           0     2
act            3           2     1

HTH

edited Aug 20, 2015 at 9:44

answered Aug 19, 2015 at 13:03

jrjc

22.1k10 gold badges67 silver badges79 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Bram Vanroy Over a year ago

Thanks. I think that I understand what's going on: create a frequency table of ALL words. Then, from that tablemake sums. First the neuter cases, then nonneuter. After each operation, drop all relevant columns, then finally count all remaining columns. Correct? One question though. What does axis=1 do? Also, I quickly tried this in Python 3.4.3 and I got the error that freqDf isn't defined. Should I first create a new table named freqDf?

jrjc Over a year ago

@BramVanroy, yes it's exactly what it's doing. axis=1 is to choose the dimension (either columns or rows, so 1 is for column and 0 is rows) along which you'll do something.

jrjc Over a year ago

@BramVanroy, I edited for freqDf, I forget to copy/past a line, but yes, you need to declare it.

Bram Vanroy Over a year ago

This is great sofar, however there is an issue. When using either solution, when an element in neuter isn't found, Python will throw an index error. As an example: change neuter to neuter = ["het", "banana"]. Though a fictious example, it can cause problems down the road.

jrjc Over a year ago

@BramVanroy, there is indeed a KeyError with the first solution, but not with the second.

|

Collectives™ on Stack Overflow

Create advanced frequency table with Python

1 Answer 1

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related