Merging columns and removing duplicates with Pandas

Question

I need to merge similar columns and remove duplicates (entries with the same date). The data frame:

     Albumin  C-reactive protein    CRP  Ferritin  Haemoglobin     Hb  Iron  Nancy Index  Plasma  Platelets  Transferrin saturation %  Transferrin saturations  UCEIS (0 to 8)    WCC  White Cell Count           test_date
0        NaN                 NaN    NaN       NaN          NaN    NaN   NaN          NaN     NaN        NaN                       NaN                      NaN             NaN    NaN             12.35 2016-04-17 23:00:00              
1        NaN                 NaN    NaN       NaN        133.0    NaN   NaN          NaN     NaN        NaN                       NaN                      NaN             NaN    NaN               NaN 2016-04-17 23:00:00              
2        NaN                 NaN    NaN       NaN          NaN    NaN   NaN          NaN     NaN      406.0                       NaN                      NaN             NaN    NaN               NaN 2016-04-17 23:00:00              
3        NaN                 NaN    NaN       NaN          NaN    NaN   NaN          NaN     NaN      406.0                       NaN                      NaN             NaN    NaN               NaN 2016-04-17 23:00:00              
4        NaN                32.2    NaN       NaN          NaN    NaN   NaN          NaN     NaN        NaN                       NaN                      NaN             NaN    NaN               NaN 2016-04-17 23:00:00             
5       36.0                 NaN    NaN       NaN          NaN    NaN   NaN          NaN     NaN        NaN                       NaN                      NaN             NaN    NaN               NaN 2016-04-17 23:00:00              
6        NaN                 NaN    NaN      99.7          NaN    NaN   NaN          NaN     NaN        NaN                       NaN                      NaN             NaN    NaN               NaN 2016-04-17 23:00:00              
7        NaN                 NaN    NaN       NaN          NaN    NaN   NaN          NaN     NaN        NaN                      25.0                      NaN             NaN    NaN               NaN 2016-04-17 23:00:00              
12      36.0                 NaN   32.2      99.7          NaN  133.0   NaN          NaN     NaN      406.0                       NaN                     25.0             NaN  12.35               NaN 2016-04-17 23:00:00              
14       NaN                 NaN    NaN       NaN          NaN    NaN   NaN          NaN     NaN        NaN                       NaN                      NaN             7.0    NaN               NaN 2016-04-25 23:00:00              
79      34.0                 NaN    5.4      55.9          NaN  133.0   NaN          NaN     NaN      372.0                       NaN                     28.0             NaN   7.99               NaN 2016-06-12 23:00:00

I need to get:

     Albumin    CRP  Ferritin     Hb  Nancy Index  Plasma  Platelets  Transferrin saturations  UCEIS (0 to 8)    WCC           test_date
12      36.0   32.2      99.7  133.0          NaN     NaN      406.0                     25.0             NaN  12.35 2016-04-17 23:00:00
14       NaN    NaN       NaN    NaN          NaN     NaN        NaN                      NaN             7.0    NaN 2016-04-25 23:00:00
79      34.0    5.4      55.9  133.0          NaN     NaN      372.0                     28.0             NaN   7.99 2016-06-12 23:00:00

So, columns 'C-reactive protein' should be merged with 'CRP', 'Hemoglobin' with 'Hb', 'Transferrin saturation %' with 'Transferrin saturation'.

I can easily remove duplicates with .drop_duplicates(), but the trick is remove not only row with the same date, but also to make sure, that the values in the same column are duplicated. For example, 'C-reactive protein' at row '4' has the same values as 'CRP' in row '12', in addition, they both have the same entry date. Given all that, I need to have only 'CRP' column with values 32.2 and the date '2016-04-17' (plus other unique columns).

EDIT

Some entries are really duplicates (absolutely identical, due to system glitches), for example (last three rows, on 2016-06-20, indices '803' and '122'). Is the solution below capable of removing such identical rows?

P.S. Thanks for the amazing and general solution for duplicate, but not identical entries.

     Albumin  C-reactive protein  CRP  Ferritin  Haemoglobin     Hb  Iron  Nancy Index  Plasma  Platelets  Transferrin saturation %  Transferrin saturations  UCEIS (0 to 8)   WCC  White Cell Count          setName           test_date
735     39.0                 NaN  0.4      52.0          NaN  144.0   NaN          NaN     NaN      197.0                       NaN                     25.0             NaN  4.88               NaN           Bloods 2016-05-31 23:00:00
803     40.0                 NaN  0.2      81.0          NaN  147.0   NaN          NaN     NaN      234.0                       NaN                     35.0             NaN  8.47               NaN           Bloods 2016-06-20 23:00:00
347      NaN                 NaN  NaN       NaN          NaN    NaN   NaN          NaN     1.0        NaN                       NaN                      NaN             NaN   NaN               NaN  Research Bloods 2016-06-20 23:00:00
122     40.0                 NaN  0.2      81.9          NaN  147.0   NaN          NaN     NaN      234.0                       NaN                     35.0             NaN  8.47               NaN           Bloods 2016-06-20 23:00:00

are you importing this from a csv or joining it from two sources? The way the data is structured looks like you have two (or more) sources. — Stael
– Stael, Commented Mar 29, 2017 at 13:43
@Stael, importing from .json. There was an update in the data collection platform, so this is why we have different names for the same quantities. Also there are some mistakes made by participants (this is why we have multiple entries on the same date). — Arnold Klein
– Arnold Klein, Commented Mar 29, 2017 at 13:44

jezrael · Accepted Answer · 2017-03-29 15:02:17Z

2

I think you need groupby with rename columns by dict:

d = {'C-reactive protein':'CRP', 'Hemoglobin':'Hb', 
     'Transferrin saturation %':'Transferrin saturations'}
df = df.groupby('test_date').max().rename(columns=d).groupby(axis=1, level=0).max()
print (df)
                     Albumin   CRP  Ferritin  Haemoglobin     Hb  Iron  \
test_date                                                                
2016-04-17 23:00:00     36.0  32.2      99.7        133.0  133.0   NaN   
2016-04-25 23:00:00      NaN   NaN       NaN          NaN    NaN   NaN   
2016-06-12 23:00:00     34.0   5.4      55.9          NaN  133.0   NaN   

                     Nancy Index  Plasma  Platelets  Transferrin saturations  \
test_date                                                                      
2016-04-17 23:00:00          NaN     NaN      406.0                     25.0   
2016-04-25 23:00:00          NaN     NaN        NaN                      NaN   
2016-06-12 23:00:00          NaN     NaN      372.0                     28.0   

                     UCEIS (0 to 8)    WCC  White Cell Count  
test_date                                                     
2016-04-17 23:00:00             NaN  12.35             12.35  
2016-04-25 23:00:00             7.0    NaN               NaN  
2016-06-12 23:00:00             NaN   7.99               NaN

More general solution is reshape by melt, remove duplicates and then create DataFrame back:

d = {'C-reactive protein':'CRP', 'Hemoglobin':'Hb', 
'Transferrin saturation %':'Transferrin saturations'}
df = df.rename(columns=d).groupby(axis=1, level=0).max()

df = pd.melt(df, id_vars='test_date').dropna(subset=['value']).drop_duplicates()

df = df.groupby(['test_date','variable'])['value'] \
       .apply(lambda x: pd.Series(x.values)) \
       .unstack(1) \
       .reset_index(level=1, drop=True) \
       .reset_index() \
       .rename_axis(None,axis=1)
print (df)

             test_date  Albumin   CRP  Ferritin      Hb  Platelets  \
0  2016-04-17 23:00:00   1000.0  32.2      99.7  1000.0      406.0   
1  2016-04-17 23:00:00     36.0   NaN       NaN   133.0        NaN   
2  2016-04-25 23:00:00      NaN   NaN       NaN     NaN        NaN   
3  2016-06-12 23:00:00     34.0   5.4      55.9   133.0      372.0   

   Transferrin saturations  UCEIS (0 to 8)    WCC  White Cell Count  
0                     25.0             NaN  12.35             12.35  
1                      NaN             NaN    NaN               NaN  
2                      NaN             7.0    NaN               NaN  
3                     28.0             NaN   7.99               NaN

edited Mar 29, 2017 at 15:02

answered Mar 29, 2017 at 13:39

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

jezrael Over a year ago

Is always for each date only same values per column and group, so max can be used?

Stael Over a year ago

if you do have multiple values that are being masked by the max, would it work to: -split the dataframe in two (d.keys() and d.values() as columns) -rename -concat and drop duplicates?

jezrael Over a year ago

I am not sure, but you need if e.g. first value 36.0 in first column is 100, so need for first datetime 2 rows - first with 100 and second with 36.0 ?

Arnold Klein Over a year ago

@jezrael: I may have multiple entries on the same date and .max() may not be appropriate. Let me check what kind of entries I have in my data set and I will return to you guys. Thanks a lot for the suggested solution!

Arnold Klein Over a year ago

@Stael, pelase see my response to jezrael regarding multiple entries on he same date.

|

Community · Accepted Answer · 2020-06-20 09:12:55Z

What @jezrael was saying is that if you had a situation where:

     Albumin  C-reactive protein    CRP  test_date
0        NaN                 NaN    32       2016-04-17 23:00:00              
1        NaN                 8.0    NaN      2016-04-17 23:00:00

then his method would erase the 8.0 reading and keep only the 32 (this is because he does it in two steps (or 3?), in this line: df = df.groupby('test_date').max().rename(columns=d).groupby(axis=1, level=0).max()

df = df.groupby('test_date').max() # selects max of each column 
                                   # while collapsing 'test_date'

which for my truncated example would give:

         Albumin  C-reactive protein    CRP  test_date
0        NaN                 8.0    32       2016-04-17 23:00:00

then rename .rename(columns=d) giving:

         Albumin  CRP    CRP  test_date
0        NaN      8.0    32   2016-04-17 23:00:00

then .groupby(axis=1, level=0).max() to group along rows (instead of down columns) which gives:

         Albumin  CRP  test_date
0        NaN      32   2016-04-17 23:00:00

which is where you run the highest risk of losing data.

Alternative

I would split the original data into two frames first

df1 = df[["C-reactive protein","Haemoglobin", ...]]
df2 = df[["CRP", "Hb"]]

# then rename

df2 = df2.rename(columns={"CRP":"C-reactive protein", "Hb":"Haemoglobin", ...})

# use concat to stack them on one another

df3 = pd.concat([df1, df2]) # i've run out of names

df3 = df3.drop_duplicates() # perhaps also drop NAs?

but this is only necessary if you have multiple non-duplicate entries for the same test on the same day.

Collectives™ on Stack Overflow

Merging columns and removing duplicates with Pandas

2 Answers 2

8 Comments

Alternative

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

8 Comments

Alternative

Comments

Your Answer

Sign up or log in

Post as a guest

Related