Fix duplicate index names in dataframe

Question

I am looking for the simplest solution to create a Python data frame from a CSV file that has duplicate index names (s1 and s2 in the example below).

Here is how the CSV file looks like:

       var1   var2    var3
unit x    8      4      12
temp y   -1     -4      -3
time     
s1        9     12      11
s2       12     15       7
month    
s1        1      3      12 
s2        2      4       6

Python data frame should be as follows:

        var1   var2    var3
unit x     8      4      12
temp y    -1     -4      -3
time s1    9     12      11
time s2   12     15       7
month s1   1      3      12
month s2   2      4       6

What's the best way to perform this operation?

jezrael · Accepted Answer · 2018-09-10 05:59:10Z

3

Use:

#convert index to Series
s = df.index.to_series()
#identify duplicated values
m = s.duplicated(keep=False)
#replace dupes by NaNs and then by forward filling
df.index = np.where(m, s.mask(m).ffill() + ' ' + s.index, s)
#remove only NaNs rows
df = df.dropna(how='all')
print (df)
          var1  var2  var3
unit x     8.0   4.0  12.0
temp y    -1.0  -4.0  -3.0
time s1    9.0  12.0  11.0
time s2   12.0  15.0   7.0
month s1   1.0   3.0  12.0
month s2   2.0   4.0   6.0

edited Sep 10, 2018 at 5:59

answered Sep 10, 2018 at 5:36

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

arqchicago Over a year ago

This line is giving the warning below: df.index = np.where(m, s.mask(m).ffill() + ' ' + s.index, s) FutureWarning: using '+' to provide set union with Indexes is deprecated, use '|' or .union() Is there a way to rewrite it using union() or '|'?

jezrael Over a year ago

@arqchicago how working df.index = np.where(m, s.mask(m).ffill() + ' ' + s, s)?

Naga kiran · Accepted Answer · 2018-09-10 05:50:37Z

0

considered dataframe

        C   D   E
A   B           
a   4   7.0 1.0 5.0
5   3.0 4.0 5.5
b   5   8.0 3.0 3.0
c   4   9.0 5.0 6.0
f   4   3.0 0.0 4.0

you can use df.reset_index drop is False which can make number of columns based on index levels then you can assign to main index once it is converted

#converting index to columns
df = df1.reset_index()
# Assigning multilevel index columns to main index
df.index = df[df.columns[0]].astype(str)+' '+df[df.columns[1]].astype(str)
# dropping the indexed columns
df = df.drop(df.columns[[0,1]],axis=1)

Out:

    C   D   E
a 4 7.0 1.0 5.0
a 5 3.0 4.0 5.5
b 5 8.0 3.0 3.0
c 4 9.0 5.0 6.0
f 4 3.0 0.0 4.0

answered Sep 10, 2018 at 5:50

Naga kiran

4,6071 gold badge21 silver badges32 bronze badges

Collectives™ on Stack Overflow

Fix duplicate index names in dataframe

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related