6

I need to reindex the 2nd level of a pandas dataframe, so that the 2nd level becomes a (complete) list 0,...,(N-1) for each 1st level index.

  • I tried using Allan/Hayden's approach, but unfortunately it only creates an index with as many rows as previously existing.
  • What I want is that for each new index, new rows are inserted (with nan values).

Example:

df = pd.DataFrame({
  'first': ['one', 'one', 'one', 'two', 'two', 'three'], 
  'second': [0, 1, 2, 0, 1, 1],
  'value': [1, 2, 3, 4, 5, 6]
})
print df

   first  second  value
0    one       0      1
1    one       1      2
2    one       2      3
3    two       0      4
4    two       1      5
5  three       1      6

# Tried using Allan/Hayden's approach, but no good for this, doesn't add the missing rows    
df['second'] = df.reset_index().groupby(['first']).cumcount()
print df
   first  second  value
0    one       0      1
1    one       1      2
2    one       2      3
3    two       0      4
4    two       1      5
5  three       0      6

My desired result is:

   first  second  value
0    one       0      1
1    one       1      2
2    one       2      3
3    two       0      4
4    two       1      5
4    two       2      nan <-- INSERTED
5  three       0      6
5  three       1      nan <-- INSERTED
5  three       2      nan <-- INSERTED
5
  • Could you just first create the data frame with all of the rows you need. Then update it with the values you have. Commented Aug 9, 2015 at 8:22
  • are the indices in "second" always contiguous and starting from 0? Commented Aug 9, 2015 at 8:22
  • Missing words from title: you want to Reindex 2nd level in incomplete multi-level dataframe to be complete, insert NANs on missing rows Commented Jul 16, 2022 at 19:59
  • Also, saying np.arange(N) is pretty obscure to non-numpy users, clearer to just say 0,...,(N-1) Commented Jul 16, 2022 at 20:01
  • In general, don't use groupby() as a poor-man's multiindex, do .set_index(['first', 'second']) wherever possible. Commented Jul 16, 2022 at 20:05

1 Answer 1

5

I think you can first set columns first and second as multi-level index, and then reindex.

# your data
# ==========================
df = pd.DataFrame({
  'first': ['one', 'one', 'one', 'two', 'two', 'three'], 
  'second': [0, 1, 2, 0, 1, 1],
  'value': [1, 2, 3, 4, 5, 6]
})

df

   first  second  value
0    one       0      1
1    one       1      2
2    one       2      3
3    two       0      4
4    two       1      5
5  three       1      6

# processing
# ============================
multi_index = pd.MultiIndex.from_product([df['first'].unique(), np.arange(3)], names=['first', 'second'])

df.set_index(['first', 'second']).reindex(multi_index).reset_index()

   first  second  value
0    one       0      1
1    one       1      2
2    one       2      3
3    two       0      4
4    two       1      5
5    two       2    NaN
6  three       0    NaN
7  three       1      6
8  three       2    NaN
Sign up to request clarification or add additional context in comments.

1 Comment

why is the .reset_index required after .reindex?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.