1

I am a noob and I have a large CSV file with data structured like this (with a lot more columns):

State  daydiff
CT     5.5
CT     6.5
CT     6.25
NY     3.2
NY     3.225
PA     7.522
PA     4.25

I want to output a new CSV where the daydiff is averaged for each State like this:

State  daydiff
CT     6.083
NY     3.2125
PA     5.886

I have tried numerous ways and the cleanest seemed to leverage pandas groupby but when i run the code below:

import pandas as pd

df = pd.read_csv('C:...input.csv')
df.groupby('State')['daydiff'].mean()

df.to_csv('C:...AverageOutput.csv')

I get a file that is identical to the original file but with a counter added in the first column with no header:

,State,daydiff
0,CT,5.5
1,CT,6.5
2,CT,6.25
3,NY,3.2
4,NY,3.225
5,PA,7.522
6,PA,4.25

I was also hoping to control the new average in datediff to a decimal going out only to the hundredths. Thanks

2
  • Use df.to_csv('C:...AverageOutput.csv', index=False) ` Commented Oct 10, 2017 at 14:31
  • @Zero Thanks. That solved the extra column but now the output is identical to the input with no averaging happening. Commented Oct 10, 2017 at 14:37

1 Answer 1

1

The "problem" with the counter is because the default behaviour for to_csvis to write the index. You should do df.to_csv('C:...AverageOutput.csv', index=False).

You can control the output format of daydiff by converting it to string. df.daydiff = df.daydiff.apply(lambda x: '{:.2f}'.format(x))

Your complete code should be:

df = pd.read_csv('C:...input.csv')
df2 = df.groupby('State')['daydiff'].mean().apply(lambda x: '{:.2f}'.format(x))
df2.to_csv('C:...AverageOutput.csv')
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks. That solved the extra column but now the output is identical to the input with no averaging happening.
Take care of the variable your outputting to_csv. You you output df.to_csv you are writing the original data to the new file. You have to make sure you're using df2.to_csv. I've updated my answer since the groupby will output a Series, not a DataFrame. Since it's a Series you need the index.
That worked great thanks, but it did not carry over the headers as I expected.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.