21

I am trying to join to dataframe on the same column "Date", the code is as follow:

import pandas as pd
from datetime import datetime
df_train_csv = pd.read_csv('./train.csv',parse_dates=['Date'],index_col='Date')

start = datetime(2010, 2, 5)
end = datetime(2012, 10, 26)

df_train_fly = pd.date_range(start, end, freq="W-FRI")
df_train_fly = pd.DataFrame(pd.Series(df_train_fly), columns=['Date'])

merged = df_train_csv.join(df_train_fly.set_index(['Date']), on = ['Date'], how = 'right', lsuffix='_x')

It complains dataframe df_train_csv has no column named "Date". I'd like to set "Date" in both dataframe as index and I am wondering what is the best way to join dataframe with date as the index?

UPDATE:

That is the sample data

Date,Weekly_Sales
2010-02-05,24924.5
2010-02-12,46039.49
2010-02-19,41595.55
2010-02-26,19403.54
2010-03-05,21827.9
2010-03-12,21043.39
2010-03-19,22136.64
2010-03-26,26229.21
2010-04-02,57258.43
2010-04-09,42960.91
2010-04-16,17596.96
2010-04-23,16145.35
2010-04-30,16555.11
2010-05-07,17413.94
2010-05-14,18926.74
2010-05-21,14773.04
2010-05-28,15580.43
2010-06-04,17558.09
2010-06-11,16637.62
2010-06-18,16216.27
2010-06-25,16328.72
2010-07-02,16333.14
2010-07-09,17688.76
2010-07-16,17150.84
2010-07-23,15360.45
2010-07-30,15381.82
2010-08-06,17508.41
2010-08-13,15536.4
2010-08-20,15740.13
2010-08-27,15793.87
2010-09-03,16241.78
2010-09-10,18194.74
2010-09-17,19354.23
2010-09-24,18122.52
2010-10-01,20094.19
2010-10-08,23388.03
2010-10-15,26978.34
2010-10-22,25543.04
2010-10-29,38640.93
2010-11-05,34238.88
2010-11-12,19549.39
2010-11-19,19552.84
2010-11-26,18820.29
2010-12-03,22517.56
2010-12-10,31497.65
2010-12-17,44912.86
2010-12-24,55931.23
2010-12-31,19124.58
2011-01-07,15984.24
2011-01-14,17359.7
2011-01-21,17341.47
2011-01-28,18461.18
2011-02-04,21665.76
2011-02-11,37887.17
2011-02-18,46845.87
2011-02-25,19363.83
2011-03-04,20327.61
2011-03-11,21280.4
2011-03-18,20334.23
2011-03-25,20881.1
2011-04-01,20398.09
2011-04-08,23873.79
2011-04-15,28762.37
2011-04-22,50510.31
2011-04-29,41512.39
2011-05-06,20138.19
2011-05-13,17235.15
2011-05-20,15136.78
2011-05-27,15741.6
2011-06-03,16434.15
2011-06-10,15883.52
2011-06-17,14978.09
2011-06-24,15682.81
2011-07-01,15363.5
2011-07-08,16148.87
2011-07-15,15654.85
2011-07-22,15766.6
2011-07-29,15922.41
2011-08-05,15295.55
2011-08-12,14539.79
2011-08-19,14689.24
2011-08-26,14537.37
2011-09-02,15277.27
2011-09-09,17746.68
2011-09-16,18535.48
2011-09-23,17859.3
2011-09-30,18337.68
2011-10-07,20797.58
2011-10-14,23077.55
2011-10-21,23351.8
2011-10-28,31579.9
2011-11-04,39886.06
2011-11-11,18689.54
2011-11-18,19050.66
2011-11-25,20911.25
2011-12-02,25293.49
2011-12-09,33305.92
2011-12-16,45773.03
2011-12-23,46788.75
2011-12-30,23350.88
2012-01-06,16567.69
2012-01-13,16894.4
2012-01-20,18365.1
2012-01-27,18378.16
2012-02-03,23510.49
2012-02-10,36988.49
2012-02-17,54060.1
2012-02-24,20124.22
2012-03-02,20113.03
2012-03-09,21140.07
2012-03-16,22366.88
2012-03-23,22107.7
2012-03-30,28952.86
2012-04-06,57592.12
2012-04-13,34684.21
2012-04-20,16976.19
2012-04-27,16347.6
2012-05-04,17147.44
2012-05-11,18164.2
2012-05-18,18517.79
2012-05-25,16963.55
2012-06-01,16065.49
2012-06-08,17666
2012-06-15,17558.82
2012-06-22,16633.41
2012-06-29,15722.82
2012-07-06,17823.37
2012-07-13,16566.18
2012-07-20,16348.06
2012-07-27,15731.18
2012-08-03,16628.31
2012-08-10,16119.92
2012-08-17,17330.7
2012-08-24,16286.4
2012-08-31,16680.24
2012-09-07,18322.37
2012-09-14,19616.22
2012-09-21,19251.5
2012-09-28,18947.81
2012-10-05,21904.47
2012-10-12,22764.01
2012-10-19,24185.27
2012-10-26,27390.81

I will read it from a csv file. But sometimes, some weeks may be missing. Therefore, I am trying to generate a date range like this:

df_train_fly = pd.date_range(start, end, freq="W-FRI")

This generated dataframe contains all weeks in the range so I need to merge those two dataframe into one.

If I check df_train_csv['Date'] and df_train_fly['Date'] from the iPython console, they both showed as dtype: datetime64[ns]

7
  • Well it doesn't work because you set the index_col to date when you read the csv, you can either not set the index_col to 'Date' OR set the 'Date' col in df_train_fly as the index column also and pass left_index=True, right_index=True to the join Commented May 21, 2014 at 15:09
  • Actually you cannot pass left_index=True etc.. unless you do a merge so if you want to use join then drop the index_col='Date' param in read_csv Commented May 21, 2014 at 15:10
  • 1
    I think in your case if you did this it should work: merged = df_train_csv.join(df_train_fly, how = 'right', lsuffix='_x') so drop the on=['Date'] param as this is for specifying a column to join on, if you leave this out the default is None which will use the index to join on Commented May 21, 2014 at 15:19
  • I think it should've worked, what version pandas, numpy and python are you using. Also are you able to post the full data? Commented May 21, 2014 at 17:04
  • I used the other option that you mentioned: not setting the 'on' param and let them join on index automatically, which is working! BTW, to set the index for a dataframe,like this df_train_fly.set_index('Date') does not really change the index and I have to assign the value back to df: df_train_fly = df_train_fly.set_index('Date') to make it work. Commented May 21, 2014 at 17:16

1 Answer 1

23

So let's dissect this:

df_train_csv = pd.read_csv('./train.csv',parse_dates=['Date'],index_col='Date')

OK first problem here is you have specified that the index column should be 'Date' this means that you will not have a 'Date' column anymore.

start = datetime(2010, 2, 5)
end = datetime(2012, 10, 26)

df_train_fly = pd.date_range(start, end, freq="W-FRI")
df_train_fly = pd.DataFrame(pd.Series(df_train_fly), columns=['Date'])

merged = df_train_csv.join(df_train_fly.set_index(['Date']), on = ['Date'], how = 'right', lsuffix='_x')

So the above join will not work as the error reported so in order to fix this:

# remove the index_col param
df_train_csv = pd.read_csv('./train.csv',parse_dates=['Date'])
# don't set the index on df_train_fly
merged = df_train_csv.join(df_train_fly, on = ['Date'], how = 'right', lsuffix='_x')

OR don't set the 'on' param:

merged = df_train_csv.join(df_train_fly, how = 'right', lsuffix='_x')

the above will use the index of both df's to join on

You can also achieve the same result by performing a merge instead:

merged = df_train_csv.merge(df_train_fly.set_index(['Date']), left_index=True, right_index=True, how = 'right', lsuffix='_x')
Sign up to request clarification or add additional context in comments.

2 Comments

Hi EdCum, thanks for your help! I removed index_col='Date' and use merged = df_train_fly.join(df_train_csv, on = ['Date'], how = 'left', lsuffix='_x'). But it gives me very strange results (all columns are NaN and date from df_train_csv is NaT) and I guess the date in df_train_csv is in a different format from that of df_train_fly, because when I use inner join, an empty set will be returned. What is wrong with the timestamp?
@wuha can you edit your question with sameple data and output the 'Date' column values

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.