I'm not python guy but I have to write something like this occasionally. I wrote this code few months back and it was serving the purpose without any errors. But today when I need to use this same script for some updated csv files. It is giving me some errors which I can't fix by my own. Please have a look at the code below following by the errors.
import pandas as pd
#import xlsxwriter
data_df = pd.read_excel("New2020Snap.xlsx")
data_df['MaxDate'] = data_df.groupby(['LeadId', 'LeadStatus'])['CreatedDate'].transform('max')
data_df['MinDate'] = data_df.groupby(['LeadId', 'LeadStatus'])['CreatedDate'].transform('min')
data_df['Difference'] = pd.to_datetime(data_df['MaxDate']) - pd.to_datetime(data_df['MinDate'])
agg_df = data_df.groupby(['LeadId', 'LeadStatus', 'Email']).agg(MaxDate=('CreatedDate', 'max'),
MinDate=('CreatedDate', 'min')).reset_index()
agg_df['Difference'] = pd.to_datetime(agg_df['MaxDate']) - pd.to_datetime(agg_df['MinDate'])
#data_df.to_json(orient='records')
with pd.ExcelWriter('../out/ComputedReport.xlsx', engine='XlsxWriter') as writer:
data_df.to_excel(writer, sheet_name='New Computed Data', index=False)
agg_df.to_excel(writer, sheet_name='Computed Agg Data', index=False)
print(data_df)
Following are the errors I'm receving from running this above script.
Traceback (most recent call last):
File "C:\Users\w-s\IdeaProjects\PythonForEverybody\src\pandas_read_opps.py", line 6, in <module>
data_df['MaxDate'] = data_df.groupby(['OpportunityID', 'OpportunityName', 'ToStage'])['CloseDate'].transform('max')
File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\groupby\generic.py", line 511, in transform
result = getattr(self, func)(*args, **kwargs)
File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\groupby\groupby.py", line 1559, in max
return self._agg_general(
File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\groupby\groupby.py", line 1017, in _agg_general
result = self.aggregate(lambda x: npfunc(x, axis=self.axis))
File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\groupby\generic.py", line 255, in aggregate
return self._python_agg_general(
File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\groupby\groupby.py", line 1094, in _python_agg_general
return self._python_apply_general(f, self._selected_obj)
File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\groupby\groupby.py", line 892, in _python_apply_general
keys, values, mutated = self.grouper.apply(f, data, self.axis)
File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\groupby\ops.py", line 213, in apply
res = f(group)
File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\groupby\groupby.py", line 1062, in <lambda>
f = lambda x: func(x, *args, **kwargs)
File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\groupby\groupby.py", line 1017, in <lambda>
result = self.aggregate(lambda x: npfunc(x, axis=self.axis))
File "<__array_function__ internals>", line 5, in amax
File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\numpy\core\fromnumeric.py", line 2705, in amax
return _wrapreduction(a, np.maximum, 'max', axis, None, out,
File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\numpy\core\fromnumeric.py", line 85, in _wrapreduction
return reduction(axis=axis, out=out, **passkwargs)
File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\generic.py", line 11468, in stat_func
return self._reduce(
File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\series.py", line 4248, in _reduce
return op(delegate, skipna=skipna, **kwds)
File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\nanops.py", line 129, in f
result = alt(values, axis=axis, skipna=skipna, **kwds)
File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\nanops.py", line 873, in reduction
result = getattr(values, meth)(axis)
File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\numpy\core\_methods.py", line 39, in _amax
return umr_maximum(a, axis, None, out, keepdims, initial, where)
TypeError: '>=' not supported between instances of 'datetime.datetime' and 'str'
Process finished with exit code 1
So basically I was working on two separate copies of the same code with slight changes in each of them. However the one which I was able to fix is pasted below. And I made the changes as suggested by the first comment under my question, by Sir Guillaume Ansanay-Alex. And the exact line of code was suggested by the answer which I'm going to mark as correct after this edit. So the following errors was there in the code.
- One of my csv column contains improperlyformatted date/time data.
- The Keyword engine=XlsxWriter wasn't in the right case which was working fine few months back. It is now working as engine=xlsxwriter, in all small case.
So the working copy for my code is as below.
import pandas as pd
#import xlsxwriter
data_df = pd.read_excel("OppAvgStageDuration.xlsx")
#suggested by the first comment and answered by the accepted one.
data_df['CloseDate'] = pd.to_datetime(data_df['CloseDate'])
data_df['MaxDate'] = data_df.groupby(['OpportunityID', 'OpportunityName', 'ToStage'])['CloseDate'].transform('max')
data_df['MinDate'] = data_df.groupby(['OpportunityID', 'OpportunityName', 'ToStage'])['CloseDate'].transform('min')
data_df['Difference'] = pd.to_datetime(data_df['MaxDate']) - pd.to_datetime(data_df['MinDate'])
agg_df = data_df.groupby(['OpportunityID', 'OpportunityName', 'ToStage']).agg(MaxDate=('CloseDate', 'max'),
MinDate=('CloseDate', 'min')).reset_index()
agg_df['Difference'] = pd.to_datetime(agg_df['MaxDate']) - pd.to_datetime(agg_df['MinDate'])
#data_df.to_json(orient='records')
with pd.ExcelWriter('../out/ComputedReportOpps.xlsx', engine='xlsxwriter') as writer:
data_df.to_excel(writer, sheet_name='New Computed Data', index=False)
agg_df.to_excel(writer, sheet_name='Computed Agg Data', index=False)
print(data_df)
CloseDatecolumn in your XLSX file has heterogeneous data, namely some cells can't be interpreted as dates.