Count the frequency of characters at a position in a string in a Pandas DataFrame column

Question

I have a question related to the df['columnA'].value_counts() method and a previous post here: Count frequency of values in pandas DataFrame column

Take this example DataFrame:

fake_data = {'columnA': ['XAVY', 'XAVY', 'XAVY', 'XAVY', 'XAVY', 'AXYV', 'AXYV', 'AXYV', 'AXYV', 'AXYV', 'AXYV']}
df = pd.DataFrame(fake_data, columns = ['columnA'])
df

I am trying to determine the frequency of each letter (X,A,V,Y) at each position in the string in this column.

In this example, position 0 would be 54% A, 46% X, position 3 would be 46% Y, 54% V...and so on.

won't position 3 be 50% V and 50%Y. Just wondering if I got you question right — Akash Kumar
– Akash Kumar, Commented Aug 31, 2019 at 8:54
Its a suggestion and thus a comment - write a method def countFreq(df) which will count occurrence of X in series/column - record. Pass your df[columnA] as an input parameter to this method. Within the countFreq(df) create a DICT , with keys like , dictFreq['X'] and dictFreq['A'] , etc . Append count of letters to these dict keys as values. — Rohit Dhankar
– Rohit Dhankar, Commented Aug 31, 2019 at 9:01
can you post the desired output too, how exactly it should look like? — anky
– anky, Commented Aug 31, 2019 at 9:30
Replies below, especially from Akash Kumar, have resolved problem. From his answer, I would add that if you want the frequency at a particularly position: stats.iloc[0] — Cactus Philosopher
– Cactus Philosopher, Commented Aug 31, 2019 at 19:19

Allen Qin · Accepted Answer · 2019-08-31 09:58:29Z

5

First convert the string to a list and then expand to columns. Then count values and calculate percentage of occurence for each letter.

(
    df.columnA.apply(list)
    .apply(pd.Series)
    .apply(pd.value_counts)
    .apply(lambda x: x.div(x.sum()))
    .fillna(0)
)

    0           1           2           3
A   0.545455    0.454545    0.000000    0.000000
V   0.000000    0.000000    0.454545    0.545455
X   0.454545    0.545455    0.000000    0.000000
Y   0.000000    0.000000    0.545455    0.454545

answered Aug 31, 2019 at 9:58

Allen Qin

20k9 gold badges55 silver badges68 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Akash Kumar · Accepted Answer · 2019-09-01 12:37:52Z

2

Maybe this helps:

new_data = df.columnA.str.split('',n=4, expand=True).drop(0, axis=1)
stats = new_data.apply(pd.Series.value_counts)
stats = stats.apply(lambda x: (x/x.sum())*100).round(2).fillna(0)
print(stats)

Output

    1      2    3     4
A   54.54 45.45 0     0
V   0     0     45.45 54.54
X   45.45 54.54 0     0
Y   0     0     54.54 45.45

edited Sep 1, 2019 at 12:37

answered Aug 31, 2019 at 9:01

Akash Kumar

1,4061 gold badge13 silver badges30 bronze badges

1 Comment

Cactus Philosopher Over a year ago

I believe there is a typo in the first line: new_data = df.columnA.str.split('',n=4, expand=True).drop(0, axis=1)

Valentino · Accepted Answer · 2019-08-31 09:15:47Z

2

This expand the linked post:

ddf = df['columnA'].apply(lambda x : pd.Series(list(x)))
counts = ddf[c].value_counts() / ddf[c].value_counts().sum() for c in ddf.columns]

counts is a list with a series produced by value_counts() for each column. So to check position 0, you do counts[0]. And it will gives:

A    0.545455
X    0.454545
Name: 0, dtype: float64

answered Aug 31, 2019 at 9:15

Valentino

7,3716 gold badges22 silver badges36 bronze badges

Comments

crayxt · Accepted Answer · 2019-08-31 08:57:06Z

0

Something to begin with

>>> df.columnA.str.split('', expand=True).apply(pd.Series.value_counts)
      0    1    2    3    4     5
   11.0  NaN  NaN  NaN  NaN  11.0
A   NaN  6.0  5.0  NaN  NaN   NaN
V   NaN  NaN  NaN  5.0  6.0   NaN
X   NaN  5.0  6.0  NaN  NaN   NaN
Y   NaN  NaN  NaN  6.0  5.0   NaN

This way columns at 0 and 5th positions are empty, and can be ignored.

answered Aug 31, 2019 at 8:57

crayxt

2,4052 gold badges15 silver badges19 bronze badges

Collectives™ on Stack Overflow

Count the frequency of characters at a position in a string in a Pandas DataFrame column

4 Answers 4

Comments

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related