I have a datetime index dataframe which contains data for every hour between 2019 and 2020 and which I import from a CSV file as follow in order to keep only the columns I want, with easier names (names are changed for work reasons):
file = 'data.csv'
df = pd.read_csv(file,sep=";", header=0, na_values=['NA', ' ' , '.'])
df['datetime']=pd.to_datetime(df['datetime'])
df['week'] = df['datetime'].dt.isocalendar().week
df['month'] = df['datetime'].dt.month
df['hour']=df['datetime'].dt.hour
df['day']=df['datetime'].dt.day
df=df.set_index(['datetime'])
df=df.rename(columns={'data1':'d1','data2':'d2','data3':'d3','data4':'d4','data5':'d5','data6':'d6','data7':'d7','data8':'d8','data9':'d9','data10':'d10','data11':'d11','data12':'d12','data13':'d13','data14':'d14','data15':'d15','data16':'d16'})
df=df[['d1','d2','d3','d4','d5','d6','d7','d8','d9','d10','d11','d12','d13','d14','d15','d16','week','month','hour','day']]
When I'm typing :
df['d4'][0:2800].min()
The answer is 995 which I know is the good answer cause I checked on the CSV file.
Now my problem is that during importation, some dates are put in the dataframe in wrong orders. I don't know why but for example 2019-09-09 will be followed by 2019-09-13 instead of 2019-09-10 .
I tried to fix it by using
df=df.sort_index(ascending=True)
or
df=df.sort_index()
and it seems to work as now all the dates are in the good order, but now that I type
df['d4'][0:2800].min()
the answer is now 870 which is a wrong value.
It seems like df.sort_index() is mixing my data, am I doing anything wrong?
df['datetime'], before you callpd.to_datetime(df['datetime'])?df['datetime']=pd.to_datetime(df['datetime'], dayfirst=True)and see if that helps?