How to create a dataframe column from values with frequency count?

Question

Given a problem set, with values and their associated frequencies, how can the sample be created in a dataframe?

Find the mean of this dataset
Value: 1 | 2 | 3
Freq:  3 | 4 | 2

Which represents the sample, [1, 1, 1, 2, 2, 2, 2, 3, 3].

I input this into Python:

>>> import pandas as pd
>>> df = pd.DataFrame({'value':[1, 2, 3], 'freq':[4, 5, 2]})
>>> df
   value  freq
0      1     3
1      2     4
2      3     2

It's not difficult to find solve basic statistics with this format. For example, the mean for this dataset is (df['value'] * df['freq']).sum() / df['freq'].sum(). However it would be nice to use built in functions/attributes such as .mean(). To do this I need to input the value/freq data as raw value data into the data frame. My end goal is this:

Does anybody know how to input datasets given in value/frequency form and create a data frame of raw data? Thank you.

Trenton McKinney · Accepted Answer · 2020-10-01 21:10:33Z

5

An option is to use np.repeat

import numpy as np

values = [1,2,3]

frequency = [3,4,2]

df = pd.DataFrame(np.repeat(values, frequency), columns=['data'])

df.mean()

edited Oct 1, 2020 at 21:10

Trenton McKinney

63.2k41 gold badges169 silver badges212 bronze badges

answered Oct 1, 2020 at 21:00

O Pardal

6825 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Guimoute · Accepted Answer · 2020-10-01 21:03:15Z

3

You could use tuple or list multiplication:

# Duplicate the value `freq` times.
values = [(value,)*freq for (value, freq) in zip(df["value"], df["freq"])]
>>> [(1, 1, 1, 1), (2, 2, 2, 2, 2), (3, 3)]

# Flatten the list of tuples into a list of values.
values = [item for sublist in values for item in sublist]
>>> [1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3]

mean = np.mean(values)

Interestingly, I cannot do df["data"] = values for some reason. It causes this error:

ValueError: Length of values does not match length of index

edited Oct 1, 2020 at 21:03

answered Oct 1, 2020 at 20:57

Guimoute

4,6743 gold badges17 silver badges34 bronze badges

Comments

Alexandra Dudkina · Accepted Answer · 2020-10-01 20:59:32Z

1

Alternative solution:

df = df.loc[df.index.repeat(df['freq'])]
df['value'].mean()

answered Oct 1, 2020 at 20:59

Alexandra Dudkina

4,5123 gold badges18 silver badges29 bronze badges

1 Comment

Farzad Saif Over a year ago

This is a good solution! I would recommend also adding the line df = df.drop(columns = ['freq']) to remove the frequency column left behind. You can also rename 'values' to 'data' with df = df.rename({'value':'data'}). This helps signify that the data frame is not a list of values that's missing a frequency column, rather this is the raw data.

Collectives™ on Stack Overflow

How to create a dataframe column from values with frequency count?

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related