0

I'm trying to duplicate rows of a pandas DataFrame (v.0.23.4, python v.3.7.1) based on an int value in one of the columns. I'm applying code from this question to do that, but I'm running into the following data type casting error: TypeError: Cannot cast array data from dtype('int64') to dtype('int32') according to the rule 'safe'. Basically, I'm not understanding why this code is attempting to cast to int32.

Starting with this,

dummy_dict = {'c1': ['a','b','c'],
              'c2': [0,1,2]}
dummy_df = pd.DataFrame(dummy_dict)
    c1  c2  c3
0   a   0   textA
1   b   1   textB
2   c   2   textC

I'm doing this

dummy_df_test = dummy_df.reindex(dummy_df.index.repeat(dummy_df['c2']))

I want this at the end. However, I'm getting the above error instead.

    c1  c2  c3
0   a   0   textA
1   b   1   textB
2   c   2   textC
3   c   2   textC
3
  • Try: dummy_df_test = dummy_df.reindex(dummy_df.index.repeat(dummy_df['c2'].astype("int32"))) Commented May 13, 2019 at 17:38
  • 1
    Note, you should replace the 0 for 1 else it will get removed with repeat. But for me this works in pandas 0.24.2: df.reindex(df.index.repeat(df.replace(0, 1).c2))` Commented May 13, 2019 at 17:45
  • Thanks, @PMende and erfan, those are both helpful answers. Commented May 14, 2019 at 18:10

3 Answers 3

2

Just a workaround:

pd.concat([dummy_df[dummy_df.c2.eq(0)],dummy_df.loc[dummy_df.index.repeat(dummy_df.c2)]])

Another fantastic suggestion courtesy @Wen

dummy_df.reindex(dummy_df.index.repeat(dummy_df['c2'].clip(lower=1)))

  c1  c2
0  a   0
1  b   1
2  c   2
2  c   2
Sign up to request clarification or add additional context in comments.

3 Comments

umm , how about dummy_df.reindex(dummy_df.index.repeat(dummy_df['c2'].clip(lower=1)))
@WeNYoBen thats fantastic, if you want you can post this as answer. :)
No worry feel free to include in yours :-)
0

I believe the answer as to why it's happening can be found here: https://github.com/numpy/numpy/issues/4384

Specifying the dtype as int32 should solve the problem as highlighted in the original comment.

Comments

0

In the first attempt all rows are duplicated, and in the second attempt just the row with the index 2. Thanks to the concat function.

df2 = pd.concat([df]*2, ignore_index=True)
print(df2)

df3= pd.concat([df, df.iloc[[2]]])
print(df3)

  c1  c2     c3
0  a   0  textA
1  b   1  textB
2  c   2  textC
  c1  c2     c3
0  a   0  textA
1  b   1  textB
2  c   2  textC
3  a   0  textA
4  b   1  textB
5  c   2  textC
  c1  c2     c3
0  a   0  textA
1  b   1  textB
2  c   2  textC
2  c   2  textC

If you plan to reset the index at the end

df3=df3.reset_index(drop=True)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.