3

I need to duplicate a row based on a column (a number).

I have this dataframe:

         Kids       City Preferred Transport  Count
0       Sofia     Boston                 Bus    1.0
1  Claire Ann        NaN                 NaN    1.0
2         Joe    Detroit               Train    3.0
3       Betty        NaN                 Car    2.0
4      Archie  Bruxelles                 NaN    1.0
5         Joe        NaN            Airplane    3.0
6        Phil     Berlin                Ship    1.0
7        Luke        NaN            Airplane    1.0

Each kid has a count number. It should appear in the dataframe as much as the number from the Count column shows. (Ex: it should only be 1 Sofia, 2 Bettys, and 3 Joes). But only have their names copied, like this:

         Kids       City Preferred Transport  Count
0       Sofia     Boston                 Bus    1.0
1  Claire Ann        NaN                 NaN    1.0
2         Joe    Detroit               Train    3.0
3       Betty        NaN                 Car    2.0
4      Archie  Bruxelles                 NaN    1.0
5         Joe        NaN            Airplane    3.0
6        Phil     Berlin                Ship    1.0
7        Luke        NaN            Airplane    1.0
8        Joe         NaN                NaN     3.0
9        Betty       NaN                NaN     2.0

I tried this code, using Pandas:

kid = nyc_trip_df.loc[nyc_trip_df['Kids'].count() > nyc_trip_df['Count'], 'Kids']

Because I thought that I would be a good idea to put all the kids that have their names appear less then their corresponding number, but somewhere I failed. And if it was successfull, then I would just append the kids to the dataframe, but I am not sure that this is the best approach though....

Can you help me please?

How can I make this work ?

1 Answer 1

2

Solution

s1 = df['Kids'].value_counts()
s2 = df.groupby('Kids')['Count'].first()   

df = df.append(s2.loc[s2.index.repeat(s2 - s1)].reset_index(), ignore_index=True)

Explanations

Calculate the count of unique values in Kids column using value_counts

>>> s1

Joe           2
Sofia         1
Betty         1
Claire Ann    1
Phil          1
Luke          1
Archie        1
Name: Kids, dtype: int64

Group the dataframe by Kids column and aggregate the Count column using first

>>> s2

Kids
Archie        1.0
Betty         2.0
Claire Ann    1.0
Joe           3.0
Luke          1.0
Phil          1.0
Sofia         1.0
Name: Count, dtype: float64

Subtract s2 from s1 to create a series which contains the number of times the name of each Kid should be repeated

>>> s2 - s1

Archie        0.0
Betty         1.0
Claire Ann    0.0
Joe           1.0
Luke          0.0
Phil          0.0
Sofia         0.0
dtype: float64

Now, repeat the index of s2 using the above calculated repetitions

>>> s2.loc[s2.index.repeat(s2 - s1)]

Kids
Betty    2.0
Joe      3.0
Name: Count, dtype: float64

Append the above repetitions to the original dataframe df

>>> df

         Kids       City Preferred Transport  Count
0       Sofia     Boston                 Bus    1.0
1  Claire Ann        NaN                 NaN    1.0
2         Joe    Detroit               Train    3.0
3       Betty        NaN                 Car    2.0
4      Archie  Bruxelles                 NaN    1.0
5         Joe        NaN            Airplane    3.0
6        Phil     Berlin                Ship    1.0
7        Luke        NaN            Airplane    1.0
8       Betty        NaN                 NaN    2.0
9         Joe        NaN                 NaN    3.0
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you very much! I learnt so many tricks from your answer!! It is really useful!
@DaphneF. Glad i could help :) by the way it is a very interesting question.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.