5

Given a problem set, with values and their associated frequencies, how can the sample be created in a dataframe?

Find the mean of this dataset
Value: 1 | 2 | 3
Freq:  3 | 4 | 2

Which represents the sample, [1, 1, 1, 2, 2, 2, 2, 3, 3].

I input this into Python:

>>> import pandas as pd
>>> df = pd.DataFrame({'value':[1, 2, 3], 'freq':[4, 5, 2]})
>>> df
   value  freq
0      1     3
1      2     4
2      3     2

It's not difficult to find solve basic statistics with this format. For example, the mean for this dataset is (df['value'] * df['freq']).sum() / df['freq'].sum(). However it would be nice to use built in functions/attributes such as .mean(). To do this I need to input the value/freq data as raw value data into the data frame. My end goal is this:

    data
0      1
1      1
2      1
3      2
4      2
5      2
6      2
7      3
8      3

Does anybody know how to input datasets given in value/frequency form and create a data frame of raw data? Thank you.

0

3 Answers 3

5

An option is to use np.repeat

import numpy as np

values = [1,2,3]

frequency = [3,4,2]

df = pd.DataFrame(np.repeat(values, frequency), columns=['data'])

df.mean()

Sign up to request clarification or add additional context in comments.

Comments

3

You could use tuple or list multiplication:

# Duplicate the value `freq` times.
values = [(value,)*freq for (value, freq) in zip(df["value"], df["freq"])]
>>> [(1, 1, 1, 1), (2, 2, 2, 2, 2), (3, 3)]

# Flatten the list of tuples into a list of values.
values = [item for sublist in values for item in sublist]
>>> [1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3]

mean = np.mean(values)

Interestingly, I cannot do df["data"] = values for some reason. It causes this error:

ValueError: Length of values does not match length of index

Comments

1

Alternative solution:

df = df.loc[df.index.repeat(df['freq'])]
df['value'].mean()

1 Comment

This is a good solution! I would recommend also adding the line df = df.drop(columns = ['freq']) to remove the frequency column left behind. You can also rename 'values' to 'data' with df = df.rename({'value':'data'}). This helps signify that the data frame is not a list of values that's missing a frequency column, rather this is the raw data.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.