Merging csv files with some columns same and others different in python

Question

I am new to coding and I'm having an issue merging csv files. I have searched similar questions and haven't found a fix. Just to include some relevant details: CSV files are cancer types over the period of 1950 - 2017 for different countries (lung cancer, colorectal cancer, stomach cancer, liver cancer and breast cancer) Below is an example of the layout of lung cancer.

 dlung.describe(include='all')   
 dlung


    Year    Cancer  Country     Gender  ASR     SE
0   1950    Lung    Australia   Male    13.89   0.56
1   1951    Lung    Australia   Male    14.84   0.57
2   1952    Lung    Australia   Male    17.19   0.61
3   1953    Lung    Australia   Male    18.21   0.62
4   1954    Lung    Australia   Male    19.05   0.63
5   1955    Lung    Australia   Male    20.65   0.65
6   1956    Lung    Australia   Male    22.05   0.67
7   1957    Lung    Australia   Male    23.93   0.69
8   1958    Lung    Australia   Male    23.77   0.68
9   1959    Lung    Australia   Male    26.12   0.71
10  1960    Lung    Australia   Male    27.08   0.72

I am interested in joining all cancer types into one dataframe based on shared columns (year, country). I have tried different methods, but they all seem to duplicate Year and Country (as below)

This one wasn't bad, but I have two columns for year and country

df_lung_colorectal = pd.concat([dlung, dcolorectal], axis = 1)

df_lung_colorectal 

Year    Cancer  Country Gender  ASR SE  Year    Cancer  Country Gender  ASR SE

If I continue like this, I will end up with 5 identical columns for YEAR and 5 for COUNTRY.

Any ideas on how merge all values that are independent (Cancer type and associated ASR (standardized risk), and SE values) with only one column for YEAR, COUNTRY (and GENDER) if possible?

jezrael · Accepted Answer · 2019-10-14 06:55:55Z

1

Yes, it is possible if use DataFrame.set_index, but then are duplicated another columns names:

print (dlung)
   Year Cancer    Country Gender    ASR    SE
0  1950   Lung  Australia   Male  13.89  0.56
1  1951   Lung  Australia   Male  14.84  0.57
2  1952   Lung  Australia   Male  17.19  0.61
3  1953   Lung  Australia   Male  18.21  0.62
4  1954   Lung  Australia   Male  19.05  0.63

print (dcolorectal)
    Year      Cancer    Country Gender    ASR    SE
6   1950  colorectal  Australia   Male  22.05  0.67
7   1951  colorectal  Australia   Male  23.93  0.69
8   1952  colorectal  Australia   Male  23.77  0.68
9   1953  colorectal  Australia   Male  26.12  0.71
10  1954  colorectal  Australia   Male  27.08  0.72

df_lung_colorectal = pd.concat([dlung.set_index(['Year','Country','Gender']), 
                                dcolorectal.set_index(['Year','Country','Gender'])], axis = 1)

print (df_lung_colorectal)
                      Cancer    ASR    SE      Cancer    ASR    SE
Year Country   Gender                                             
1950 Australia Male     Lung  13.89  0.56  colorectal  22.05  0.67
1951 Australia Male     Lung  14.84  0.57  colorectal  23.93  0.69
1952 Australia Male     Lung  17.19  0.61  colorectal  23.77  0.68
1953 Australia Male     Lung  18.21  0.62  colorectal  26.12  0.71
1954 Australia Male     Lung  19.05  0.63  colorectal  27.08  0.72

But I think better is first concat all DataFrame together with axis=0, what is default value, so should be removed and last reshape by DataFrame.set_index and DataFrame.unstack:

df = pd.concat([dlung, dcolorectal]).set_index(['Year','Country','Gender','Cancer']).unstack()
df.columns = df.columns.map('_'.join)
df = df.reset_index()
print (df)
   Year    Country Gender  ASR_Lung  ASR_colorectal  SE_Lung  SE_colorectal
0  1950  Australia   Male     13.89           22.05     0.56           0.67
1  1951  Australia   Male     14.84           23.93     0.57           0.69
2  1952  Australia   Male     17.19           23.77     0.61           0.68
3  1953  Australia   Male     18.21           26.12     0.62           0.71
4  1954  Australia   Male     19.05           27.08     0.63           0.72

answered Oct 14, 2019 at 6:55

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Quant Christo Over a year ago

I think that answer by @jezrael is pretty comprehensive to your question, but you should also consider: is your format is consistent with 'tidy data' format?; just my 50 cents: ibm.com/developerworks/community/blogs/jfp/entry/…

Ryan Burke Over a year ago

Thank you for the response jezrael. It's great! Also, thanks Quant Christo. I agree it is not in the best format. I will reformat to keep all variables grouped as opposed to separate as I have above!

Ramesh Kamath · Accepted Answer · 2019-10-14 06:59:40Z

0

Concat with axis=0 to merge them row-wise.

with axis=1 you are asking it to Concat side-to-side.

answered Oct 14, 2019 at 6:59

Ramesh Kamath

3092 silver badges6 bronze badges

Collectives™ on Stack Overflow

Merging csv files with some columns same and others different in python

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related