Split Set into multiple columns Pandas Python

Question

I have a dataframe

        IDs            Types
0      1001            {251}
1      1013       {251, 101}
2      1004       {251, 701}
3      3011           {251}
4      1014            {701}
5      1114            {251}
6      1015            {251}

where df['Types'] has sets in each row. I want to convert this column into multiple columns such that I can get the following output

        IDs    Type1   Type2  
0      1001     251      -
1      1013     251     101
2      1004     251     701
3      3011     251      -
4      1014     701      -     
5      1114     251      -
6      1015     251      -

Currently, I am using the following code to achieve this

pd.concat([df['Types'].apply(pd.Series), df['IDs']], axis = 1)

But it return the following error

  Traceback (most recent call last):
  File "C:/Users/PycharmProjects/test/test.py", line 48, in <module>
    df = pd.concat([df['Types'].apply(pd.Series), df['IDs']], axis = 1)
  File "C:\Python\Python35\lib\site-packages\pandas\core\series.py", line 2294, in apply
    mapped = lib.map_infer(values, f, convert=convert_dtype)
  File "pandas\src\inference.pyx", line 1207, in pandas.lib.map_infer (pandas\lib.c:66124)
  File "C:\Python\Python35\lib\site-packages\pandas\core\series.py", line 223, in __init__
    "".format(data.__class__.__name__))
TypeError: 'set' type is unordered

Please guide me how can I get the desired output. Thanks

jezrael · Accepted Answer · 2017-04-17 15:03:57Z

2

I think you need DataFrame constructor first, then rename columns and last fillna.

But if use fillna with some string, it can be problem, because get mixed numeric with strings(-) data and some pandas functions can be broken.

df1 = pd.DataFrame(df['Types'].values.tolist()) \
        .rename(columns = lambda x: 'Type{}'.format(x+1)) \
        .fillna('-')
print (df1)
   Type1 Type2
0    251     -
1    251   101
2    251   701

df2 = pd.concat([df['IDs'], df1], axis = 1)
print (df2)
    IDs  Type1 Type2
0  1001    251     -
1  1013    251   101
2  1004    251   701

Another slowier solution:

df1 = df['Types'].apply(lambda x: pd.Series(list(x))) \
                 .rename(columns =lambda x: 'Type{}'.format(x+1)) \
                 .fillna('-')

df2 = pd.concat([df['IDs'], df1], axis = 1)
print (df2)
    IDs  Type1 Type2
0  1001  251.0     -
1  1013  251.0   101
2  1004  251.0   701

edited Apr 17, 2017 at 15:03

answered Apr 17, 2017 at 14:43

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

muazfaiz Over a year ago

Thanks. I was thinking why I need to convert the set into list ?

jezrael Over a year ago

I am not sure, but this solution is faster as .apply(Series), but .apply(lambda x: pd.Series(list(x))) can works.

gold_cy · Accepted Answer · 2017-04-24 14:05:31Z

2

This should work:

temp = pd.DataFrame(df.Types.values.tolist()).add_prefix('Types_').fillna('-').rename(columns={'Types_0':'Type1','Types_1':'Type2'})

df = pd.concat([df.drop('Types',axis=1), temp], axis=1)

    IDs  Types_0  Types_1
0  1001      251      NaN
1  1013      251    101.0
2  1001      251    701.0

Edit: I missed the ('-') for missing values, should be good now.

Edit2: Column names as @jezrael pointed out

edited Apr 24, 2017 at 14:05

answered Apr 17, 2017 at 14:47

gold_cy

14.2k4 gold badges27 silver badges55 bronze badges

3 Comments

jezrael Over a year ago

I think your output is a bit different as OP want, please check it.

jezrael Over a year ago

I think Types_0 Types_1

gold_cy Over a year ago

You are correct. I would simply use a rename convention, I'll change mine but your answer already provides this :thumbs up:

zipa · Accepted Answer · 2017-04-17 14:50:50Z

0

Another approach:

df['Type1'] = df['Types'].apply(lambda x: list(x)[0])
df['Type2'] = df['Types'].apply(lambda x: list(x)[1] if len(x) > 1 else '-')

answered Apr 17, 2017 at 14:50

zipa

28k6 gold badges45 silver badges62 bronze badges

Comments

Community · Accepted Answer · 2017-05-23 12:34:12Z

0

One liner (very similar to @DmitryPolonskiy's solution):

In [96]: df.join(pd.DataFrame(df.pop('Types').values.tolist(), index=df.index)
                   .add_prefix('Type_')) \
           .fillna('-')
Out[96]:
    IDs  Type_0 Type_1
0  1001     251      -
1  1013     251    101
2  1004     251    701

edited May 23, 2017 at 12:34

CommunityBot

11 silver badge

answered Apr 17, 2017 at 15:00

MaxU - stand with Ukraine

212k37 gold badges402 silver badges436 bronze badges

Collectives™ on Stack Overflow

Split Set into multiple columns Pandas Python

4 Answers 4

2 Comments

3 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related