2

I am trying to manipulate a CSV using Pandas and I need to get the data into the format of one row per ID.

This is an example of what I am trying to accomplish:

From:

df = pd.DataFrame({
'ID': [1, 1, 1, 2, 2, 3, 4, 4, 5, 5, 5, 5, 6, 7], 
'Color': ['Orange', 'Yellow', 'Red', 'Yellow', 'Green', 'Purple', 'Red', 'Orange', 'Orange', 'Red', 'Yellow', 'Purple', 'Red', 'Orange'], 
'Fruit': ['Orange', 'Banana', 'Apple', 'Banana', 'Pear', 'Grapes', 'Apple', 'Orange', 'Peach', 'Apple', 'Banana', 'Grapes', 'Apple', 'Peach']
})
ID Color Fruit
1 Orange Orange
1 Yellow Banana
1 Red Apple
2 Yellow Banana
2 Green Pear
3 Purple Grapes
4 Red Apple
4 Orange Orange
5 Orange Peach
5 Red Apple
5 Yellow Banana
5 Purple Grapes
6 Red Apple
7 Orange Peach

to

ID Color Fruit
1 Orange, Yellow, Red Orange, Banana, Apple
2 Yellow, Green Banana, Pear
3 Purple Grapes
4 Red, Orange Apple, Orange
5 Orange, Red, Yellow, Purple Peach, Apple, Banana, Grapes
6 Red Apple
7 Orange Peach

Additionally, it is important that the rows all combine in the same order (ex. Red stays lined up with Apple)

I have found that using this line below, I can accomplish what I need for a single column but I am struggling to figure out how to do this for all columns (in the actual data there are about 16 columns).

df_combined = df.groupby(['ID'])['Color'].agg(','.join).reset_index()

Can anyone point me in the right direction? I think I could probably accomplish this in a loop but I want to make sure I am not making the code too complex for the large dataset it is for eventually.

1
  • ultimately you could repeat this code for others columns and later concatenate/merge them to new dataframe. Commented Oct 10 at 17:27

1 Answer 1

1

Run without ['Color']

new = df.groupby("ID").agg(",".join).reset_index()

and it will run it for all columns.


Minimal working example:

data = """ID    Color   Fruit
1   Orange  Orange
1   Yellow  Banana
1   Red     Apple
2   Yellow  Banana
2   Green   Pear
3   Purple  Grapes
4   Red     Apple
4   Orange  Orange
5   Orange  Peach
5   Red     Apple
5   Yellow  Banana
5   Purple  Grapes
6   Red     Apple
7   Orange  Peach"""

import pandas as pd
import io

df = pd.read_csv(io.StringIO(data), sep=r"\s+")
print(df)

new = df.groupby("ID").agg(",".join).reset_index() #drop=True)
print(new)

Result:

   ID                     Color                      Fruit
0   1         Orange,Yellow,Red        Orange,Banana,Apple
1   2              Yellow,Green                Banana,Pear
2   3                    Purple                     Grapes
3   4                Red,Orange               Apple,Orange
4   5  Orange,Red,Yellow,Purple  Peach,Apple,Banana,Grapes
5   6                       Red                      Apple
6   7                    Orange                      Peach

EDIT:

Your error in comment may suggest that you have column with numbers and this may need to convert them to strings before running ",".join - ie. using map(str, column)

def convert(column):
    #column = map(str, column)
    #return ",".join(column))
    return ",".join(map(str, column))

new = df.groupby("ID").agg(convert).reset_index()

Other idea is to keep everything as lists instead of converting to strings

new = df.groupby("ID").agg(list).reset_index()

Eventually you can check type of data in column and

  • columns with integer/float values keep as list of values,
  • other columns convert to strings.
def convert(column):
    if column.dtype in (int, float):
        return list(column)
    else:
        return ",".join(map(str, column))

new = df.groupby("ID").agg(convert).reset_index()

Minimal working code with columnRank which has integer values.

data = """ID    Color   Fruit   Rank
1   Orange  Orange  1
1   Yellow  Banana  2
1   Red     Apple   3
2   Yellow  Banana  4
2   Green   Pear    5
3   Purple  Grapes  6
4   Red     Apple   7
4   Orange  Orange  8
5   Orange  Peach   9
5   Red     Apple   10
5   Yellow  Banana  11
5   Purple  Grapes  12
6   Red     Apple   13
7   Orange  Peach   14"""

import pandas as pd
import io

df = pd.read_csv(io.StringIO(data), sep=r"\s+")
print(df)

# new = df.groupby("ID").agg(",".join).reset_index(drop=True)
# print(new)

print("--- strings ---")

def convert(column):
    # column = map(str, column)
    # return ",".join(column))
    return ",".join(map(str, column))

new = df.groupby("ID").agg(convert).reset_index(drop=True)
print(new)

print("--- lists ---")

new = df.groupby("ID").agg(list).reset_index(drop=True)
print(new)

print("--- strings and lists ---")

def convert(column):
    if column.dtype in (int, float):
        return list(column)
    else:
        return ",".join(map(str, column))

new = df.groupby("ID").agg(convert).reset_index(drop=True)
print(new)

Result:

    ID   Color   Fruit  Rank
0    1  Orange  Orange     1
1    1  Yellow  Banana     2
2    1     Red   Apple     3
3    2  Yellow  Banana     4
4    2   Green    Pear     5
5    3  Purple  Grapes     6
6    4     Red   Apple     7
7    4  Orange  Orange     8
8    5  Orange   Peach     9
9    5     Red   Apple    10
10   5  Yellow  Banana    11
11   5  Purple  Grapes    12
12   6     Red   Apple    13
13   7  Orange   Peach    14

--- strings ---

                      Color                      Fruit        Rank
0         Orange,Yellow,Red        Orange,Banana,Apple       1,2,3
1              Yellow,Green                Banana,Pear         4,5
2                    Purple                     Grapes           6
3                Red,Orange               Apple,Orange         7,8
4  Orange,Red,Yellow,Purple  Peach,Apple,Banana,Grapes  9,10,11,12
5                       Red                      Apple          13
6                    Orange                      Peach          14

--- lists ---

                           Color                           Fruit             Rank
0          [Orange, Yellow, Red]         [Orange, Banana, Apple]        [1, 2, 3]
1                [Yellow, Green]                  [Banana, Pear]           [4, 5]
2                       [Purple]                        [Grapes]              [6]
3                  [Red, Orange]                 [Apple, Orange]           [7, 8]
4  [Orange, Red, Yellow, Purple]  [Peach, Apple, Banana, Grapes]  [9, 10, 11, 12]
5                          [Red]                         [Apple]             [13]
6                       [Orange]                         [Peach]             [14]

--- strings and lists ---

                      Color                      Fruit             Rank
0         Orange,Yellow,Red        Orange,Banana,Apple        [1, 2, 3]
1              Yellow,Green                Banana,Pear           [4, 5]
2                    Purple                     Grapes              [6]
3                Red,Orange               Apple,Orange           [7, 8]
4  Orange,Red,Yellow,Purple  Peach,Apple,Banana,Grapes  [9, 10, 11, 12]
5                       Red                      Apple             [13]
6                    Orange                      Peach             [14]
Sign up to request clarification or add additional context in comments.

5 Comments

Thank you. When I try this with my example data, it works. But, when I try to use it on my larger dataset I get this error: --------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[83], line 2 1 LN_HCP_STATE_LICENSE ----> 2 LN_HCP_STATE_LICENSE.groupby('LNPID')['RANK'].agg(','.join).reset_index()
I'm not sure the best way to show the error here, but it seems to be coming from the groupby call...
you could edit question to add full error. Or you could create new question with example data which makes problem, and code which makes problem - so everyone could test this problem and use it to create solution. In new question you may add link to current question - to show that this is continuation - but rather you should add all information so other people didn't have to visit this question to see details.
error shows TypeError which may suggests that some column has different type of data - ie. integers - and they can't use directly ",".join but they have to convert numbers to strings, or they need to convert it to list of numbers - and this may need to use more complex code in .agg(). It would check type of data and use join() or list()
I added code which converts values to strings before ",".join

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.