Python pandas error while reading and writing csv file

Question

I'm not python guy but I have to write something like this occasionally. I wrote this code few months back and it was serving the purpose without any errors. But today when I need to use this same script for some updated csv files. It is giving me some errors which I can't fix by my own. Please have a look at the code below following by the errors.

import pandas as pd
#import xlsxwriter

data_df = pd.read_excel("New2020Snap.xlsx")

data_df['MaxDate'] = data_df.groupby(['LeadId', 'LeadStatus'])['CreatedDate'].transform('max')
data_df['MinDate'] = data_df.groupby(['LeadId', 'LeadStatus'])['CreatedDate'].transform('min')
data_df['Difference'] = pd.to_datetime(data_df['MaxDate']) - pd.to_datetime(data_df['MinDate'])

agg_df = data_df.groupby(['LeadId', 'LeadStatus', 'Email']).agg(MaxDate=('CreatedDate', 'max'),
                                                                MinDate=('CreatedDate', 'min')).reset_index()
agg_df['Difference'] = pd.to_datetime(agg_df['MaxDate']) - pd.to_datetime(agg_df['MinDate'])

#data_df.to_json(orient='records')

with pd.ExcelWriter('../out/ComputedReport.xlsx', engine='XlsxWriter') as writer:
    data_df.to_excel(writer, sheet_name='New Computed Data', index=False)
    agg_df.to_excel(writer, sheet_name='Computed Agg Data', index=False)

print(data_df)

Following are the errors I'm receving from running this above script.

Traceback (most recent call last):
  File "C:\Users\w-s\IdeaProjects\PythonForEverybody\src\pandas_read_opps.py", line 6, in <module>
    data_df['MaxDate'] = data_df.groupby(['OpportunityID', 'OpportunityName', 'ToStage'])['CloseDate'].transform('max')
  File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\groupby\generic.py", line 511, in transform
    result = getattr(self, func)(*args, **kwargs)
  File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\groupby\groupby.py", line 1559, in max
    return self._agg_general(
  File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\groupby\groupby.py", line 1017, in _agg_general
    result = self.aggregate(lambda x: npfunc(x, axis=self.axis))
  File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\groupby\generic.py", line 255, in aggregate
    return self._python_agg_general(
  File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\groupby\groupby.py", line 1094, in _python_agg_general
    return self._python_apply_general(f, self._selected_obj)
  File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\groupby\groupby.py", line 892, in _python_apply_general
    keys, values, mutated = self.grouper.apply(f, data, self.axis)
  File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\groupby\ops.py", line 213, in apply
    res = f(group)
  File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\groupby\groupby.py", line 1062, in <lambda>
    f = lambda x: func(x, *args, **kwargs)
  File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\groupby\groupby.py", line 1017, in <lambda>
    result = self.aggregate(lambda x: npfunc(x, axis=self.axis))
  File "<__array_function__ internals>", line 5, in amax
  File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\numpy\core\fromnumeric.py", line 2705, in amax
    return _wrapreduction(a, np.maximum, 'max', axis, None, out,
  File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\numpy\core\fromnumeric.py", line 85, in _wrapreduction
    return reduction(axis=axis, out=out, **passkwargs)
  File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\generic.py", line 11468, in stat_func
    return self._reduce(
  File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\series.py", line 4248, in _reduce
    return op(delegate, skipna=skipna, **kwds)
  File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\nanops.py", line 129, in f
    result = alt(values, axis=axis, skipna=skipna, **kwds)
  File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\nanops.py", line 873, in reduction
    result = getattr(values, meth)(axis)
  File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\numpy\core\_methods.py", line 39, in _amax
    return umr_maximum(a, axis, None, out, keepdims, initial, where)
TypeError: '>=' not supported between instances of 'datetime.datetime' and 'str'

Process finished with exit code 1

So basically I was working on two separate copies of the same code with slight changes in each of them. However the one which I was able to fix is pasted below. And I made the changes as suggested by the first comment under my question, by Sir Guillaume Ansanay-Alex. And the exact line of code was suggested by the answer which I'm going to mark as correct after this edit. So the following errors was there in the code.

One of my csv column contains improperlyformatted date/time data.
The Keyword engine=XlsxWriter wasn't in the right case which was working fine few months back. It is now working as engine=xlsxwriter, in all small case.

So the working copy for my code is as below.

import pandas as pd
#import xlsxwriter

data_df = pd.read_excel("OppAvgStageDuration.xlsx")
#suggested by the first comment and answered by the accepted one.
data_df['CloseDate'] = pd.to_datetime(data_df['CloseDate']) 

data_df['MaxDate'] = data_df.groupby(['OpportunityID', 'OpportunityName', 'ToStage'])['CloseDate'].transform('max')
data_df['MinDate'] = data_df.groupby(['OpportunityID', 'OpportunityName', 'ToStage'])['CloseDate'].transform('min')
data_df['Difference'] = pd.to_datetime(data_df['MaxDate']) - pd.to_datetime(data_df['MinDate'])

agg_df = data_df.groupby(['OpportunityID', 'OpportunityName', 'ToStage']).agg(MaxDate=('CloseDate', 'max'),
                                                                              MinDate=('CloseDate', 'min')).reset_index()
agg_df['Difference'] = pd.to_datetime(agg_df['MaxDate']) - pd.to_datetime(agg_df['MinDate'])

#data_df.to_json(orient='records')

with pd.ExcelWriter('../out/ComputedReportOpps.xlsx', engine='xlsxwriter') as writer:
    data_df.to_excel(writer, sheet_name='New Computed Data', index=False)
    agg_df.to_excel(writer, sheet_name='Computed Agg Data', index=False)

print(data_df)

Looks like the CloseDate column in your XLSX file has heterogeneous data, namely some cells can't be interpreted as dates. — Guillaume Ansanay-Alex
– Guillaume Ansanay-Alex, Commented Apr 8, 2021 at 14:54
@GuillaumeAnsanay-Alex thanks for the suggestion. One thing I'm confused about it is, how do you know the column name is CloseDate? As I used the CreatedDate in the code. However the correct file on which I'm working is actually using CloseDate on it. And this is the actual error. Which is now fixed many thanks. — Superman
– Superman, Commented Apr 8, 2021 at 15:25

tdy · Accepted Answer · 2021-04-08 15:01:00Z

1

Currently you're converting the derived MaxDate and MinDate columns to_datetime(), but try converting the source CreatedDate column to_datetime() from the start:

data_df = pd.read_excel("New2020Snap.xlsx")
data_df['CreatedDate'] = pd.to_datetime(data_df['CreatedDate'])

If that doesn't work then I think it's what Guillaume commented, that it has mixed format.

answered Apr 8, 2021 at 15:01

tdy

42k42 gold badges124 silver badges125 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Python pandas error while reading and writing csv file

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related