3

I have a question related to the df['columnA'].value_counts() method and a previous post here: Count frequency of values in pandas DataFrame column

Take this example DataFrame:

fake_data = {'columnA': ['XAVY', 'XAVY', 'XAVY', 'XAVY', 'XAVY', 'AXYV', 'AXYV', 'AXYV', 'AXYV', 'AXYV', 'AXYV']}
df = pd.DataFrame(fake_data, columns = ['columnA'])
df

I am trying to determine the frequency of each letter (X,A,V,Y) at each position in the string in this column.

In this example, position 0 would be 54% A, 46% X, position 3 would be 46% Y, 54% V...and so on.

5
  • won't position 3 be 50% V and 50%Y. Just wondering if I got you question right Commented Aug 31, 2019 at 8:54
  • Yes! Will correct typo. Commented Aug 31, 2019 at 8:55
  • Its a suggestion and thus a comment - write a method def countFreq(df) which will count occurrence of X in series/column - record. Pass your df[columnA] as an input parameter to this method. Within the countFreq(df) create a DICT , with keys like , dictFreq['X'] and dictFreq['A'] , etc . Append count of letters to these dict keys as values. Commented Aug 31, 2019 at 9:01
  • 1
    can you post the desired output too, how exactly it should look like? Commented Aug 31, 2019 at 9:30
  • Replies below, especially from Akash Kumar, have resolved problem. From his answer, I would add that if you want the frequency at a particularly position: stats.iloc[0] Commented Aug 31, 2019 at 19:19

4 Answers 4

5

First convert the string to a list and then expand to columns. Then count values and calculate percentage of occurence for each letter.

(
    df.columnA.apply(list)
    .apply(pd.Series)
    .apply(pd.value_counts)
    .apply(lambda x: x.div(x.sum()))
    .fillna(0)
)

    0           1           2           3
A   0.545455    0.454545    0.000000    0.000000
V   0.000000    0.000000    0.454545    0.545455
X   0.454545    0.545455    0.000000    0.000000
Y   0.000000    0.000000    0.545455    0.454545
Sign up to request clarification or add additional context in comments.

Comments

2

Maybe this helps:

new_data = df.columnA.str.split('',n=4, expand=True).drop(0, axis=1)
stats = new_data.apply(pd.Series.value_counts)
stats = stats.apply(lambda x: (x/x.sum())*100).round(2).fillna(0)
print(stats)

Output

    1      2    3     4
A   54.54 45.45 0     0
V   0     0     45.45 54.54
X   45.45 54.54 0     0
Y   0     0     54.54 45.45

1 Comment

I believe there is a typo in the first line: new_data = df.columnA.str.split('',n=4, expand=True).drop(0, axis=1)
2

This expand the linked post:

ddf = df['columnA'].apply(lambda x : pd.Series(list(x)))
counts = ddf[c].value_counts() / ddf[c].value_counts().sum() for c in ddf.columns]

counts is a list with a series produced by value_counts() for each column. So to check position 0, you do counts[0]. And it will gives:

A    0.545455
X    0.454545
Name: 0, dtype: float64

Comments

0

Something to begin with

>>> df.columnA.str.split('', expand=True).apply(pd.Series.value_counts)
      0    1    2    3    4     5
   11.0  NaN  NaN  NaN  NaN  11.0
A   NaN  6.0  5.0  NaN  NaN   NaN
V   NaN  NaN  NaN  5.0  6.0   NaN
X   NaN  5.0  6.0  NaN  NaN   NaN
Y   NaN  NaN  NaN  6.0  5.0   NaN

This way columns at 0 and 5th positions are empty, and can be ignored.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.