0

I would like to merge two CSV files as follow:

First CSV File :

df = pd.DataFrame()
df["ticket_number"] = ['AAA', 'AAA', 'AAA', 'ABC', 'ABA','ADC','ABA','BBB']
df["train_board_station"] = ['Tokyo', 'LA', 'Paris', 'New_York', 'Delhi','Phoenix', 'London','LA']
df["train_off_station"] = ['Phoenix', 'London', 'Sydney', 'Berlin', 'Shanghai','LA', 'Paris', 'New_York']

Second CSV file:

rec = pd.DataFrame()
rec["code"] = ['Tokyo','London','Paris','New_York','Shanghai','LA','Sydney','Berlin','Phoenix','Delhi']
rec["count_A"] = ['1.2','7.8','4','8','7.8','3','8','5','2','10']
rec["count_B"] = ['12','78','4','8','78','36','88','51','25','10']

I use the following code:

for x in ["board", "off"]:
    df["station"] = df["train_" + x + "_station"]
    df["code"] = df["train_" + x + "_station"]
    df = pd.concat([df,rec], axis=1, join_axes=[df.index])
    df[x + "_count_A"] = df["count_A"]
    df[x + "_count_B"] = df["count_B"]
    df = df.drop(["station", "code","count_A","count_B"], axis=1)

I get the following incorrect output :

ticket_number,train_board_station,train_off_station,board_count_A,board_count_B,off_count_A,off_count_B
AAA,Tokyo,Phoenix,1.2,12,1.2,12
AAA,LA,London,7.8,78,7.8,78
AAA,Paris,Sydney,4,4,4,4
ABC,New_York,Berlin,8,8,8,8
ABA,Delhi,Shanghai,7.8,78,7.8,78
ADC,Phoenix,LA,3,36,3,36
ABA,London,Paris,8,88,8,88
BBB,LA,New_York,5,51,5,51

I notice that instead of count_A and count_B merging with train_board station and train_off_station of same line, first line gets merged with train_board_station and second lines gets merged with train_off_station twice.

The expected output is:

ticket_number,train_board_station,train_off_station,board_count_A,board_count_B,off_count_A,off_count_B
AAA,Tokyo,Phoenix,1.2,12,2,25
AAA,LA,London,3,36,7.8,78
AAA,Paris,Sydney,4,4,8,88
ABC,New_York,Berlin,8,8,5,51
ABA,Delhi,Shanghai,10,10,7.8,78
ADC,Phoenix,LA,2,26,3,36
ABA,London,Paris,7.7,78,4,4
BBB,LA,New_York,36,36,8,8
1
  • Can you paste the expected output for more clarity. Commented May 15, 2017 at 11:20

1 Answer 1

0

There is problem with duplicates, I use join with left join:

for x in ["board", "off"]:
    df["code"] = df["station"] = df["train_" + x + "_station"]
    df = df.join(rec.set_index('code'), on='code')
    df[x + "_count_A"] = df["count_A"]
    df[x + "_count_B"] = df["count_B"]
    df = df.drop(["station", "code","count_A","count_B"], axis=1)

print (df)
  ticket_number train_board_station train_off_station board_count_A  \
0           AAA               Tokyo           Phoenix           1.2   
1           AAA                  LA            London             3   
2           AAA               Paris            Sydney             4   
3           ABC            New_York            Berlin             8   
4           ABA               Delhi          Shanghai            10   
5           ADC             Phoenix                LA             2   
6           ABA              London             Paris           7.8   
7           BBB                  LA          New_York             3   

  board_count_B off_count_A off_count_B  
0            12           2          25  
1            36         7.8          78  
2             4           8          88  
3             8           5          51  
4            10         7.8          78  
5            25           3          36  
6            78           4           4  
7            36           8           8  
Sign up to request clarification or add additional context in comments.

3 Comments

Thank you for it, I will check it.
And also the input has been changed slightly. i have removed station_A and station_B for simplicity
Sorry, previous solution was complicated, not I think it is simplier and hope correct. Please check it.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.