How to build index from multiple columns and set to a column pandas data frame?

Question

I´d like to learn how to data frame column as code maped from multiple columns.

In the partial example below I was trying what would could be a clumsy way folowing the path: get unique values as a temporary data frame; concatenate some prefix string to temp row number as a new column and them join the 2 data frames.

df = pd.DataFrame({'col1' : ['A1', 'A2', 'A1', 'A3'],
                   'col2' : ['B1', 'B2', 'B1', 'B1'],
                   'value' : [100, 200, 300, 400],
                   })

tmp = df[['col1','col2']].drop_duplicates(['col1', 'col2'])


#   col1 col2
# 0   A1   B1
# 1   A2   B2
# 3   A3   B1

The first question is how to get 'temp' row number and its value to a tmp column?

And what is the clever pythonic way to achieve the result below from df?

dfnew = pd.DataFrame({'col1' : ['A1', 'A2', 'A1', 'A3'],
                   'col2' : ['B1', 'B2', 'B1', 'B1'],
                   'code' :  ['CODE0','CODE1', 'CODE0', 'CODE3'],
                   'value' : [100, 200, 300, 400],
                   })

    code col1 col2  value
0  CODE0   A1   B1    100
1  CODE1   A2   B2    200
2  CODE0   A1   B1    300
3  CODE3   A3   B1    400

thanks.

After the answers and just as an exercise I kept working on the non-pythonic version I had in mind with insights I got from great answers, and reached this:

tmp = df[['col1','col2']].drop_duplicates(['col1', 'col2'])

tmp.reset_index(inplace=True)

tmp.drop('index', axis=1, inplace=True)

tmp['code'] = tmp.index.to_series().apply(lambda x: 'code' + format(x, '04d'))

dfnew = pd.merge(df, tmp, on=['col1', 'col2'])

At the time of posting this question, I did not realize that would be nicer to have the index reset to have a fresh sequence instead of their original index numbers.

I tried some variations but I did not get how to chain 'reset_index' and 'drop' in just one command.

I´m starting to enjoy Python. Thank you all.

piRSquared · Accepted Answer · 2017-01-07 11:39:22Z

2

groupby on df.index with ['col1', 'col2'] using transform('first') and map

df.assign(
    code=df.index.to_series().groupby(
        [df.col1, df.col2]
    ).transform('first').map('CODE{}'.format)
)[['code'] + df.columns.tolist()]

    code col1 col2  value
0  CODE0   A1   B1    100
1  CODE1   A2   B2    200
2  CODE0   A1   B1    300
3  CODE3   A3   B1    400

explanation

# turn index to series so I can perform a groupby on it
idx_series = df.index.to_series()

# groupby col1 and col2 to establish uniqueness
idx_gb = idx_series.groupby([df.col1, df.col2])

# get first index value in each unique group
# and broadcast over entire group with transform
idx_tf = idx_gb.transform('first')

# map a format function to get desired string
code = idx_tf.map('code{}'.format)

# use assign to create new column
df.assign(code=code)

edited Jan 7, 2017 at 11:39

answered Jan 7, 2017 at 11:27

piRSquared

296k68 gold badges509 silver badges654 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

jcarlos Over a year ago

Hi @piRSquared, thank you, very fast solution indeed. An extra help?, how could I set 'groupby( [df.col1, df.col2]' to be done on a variable list of columns/fields ? something like 'df.groupby[ [ fields list/series ] ]'. I´d like to generalize it to a function with field names as base to code as parameters.

jezrael · Accepted Answer · 2017-01-07 09:50:47Z

2

You can first sort_values of columns col1 and col2 where by duplicated find all duplicates:

df = df.sort_values(['col1', 'col2'])
mask = df.duplicated(['col1','col2'])
print (mask)
0    False
2     True
1    False
3    False
dtype: bool

Then use insert if need specify position of output column code with numpy.where and fillna missing values. Last sort_index:

df.insert(0, 'code', np.where(mask, np.nan, 'CODE' + df.index.astype(str)))
df.code = df.code.ffill()
df = df.sort_index()
print (df)
    code col1 col2  value
0  CODE0   A1   B1    100
1  CODE1   A2   B2    200
2  CODE0   A1   B1    300
3  CODE3   A3   B1    400

edited Jan 7, 2017 at 9:50

answered Jan 7, 2017 at 9:38

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

5 Comments

jezrael Over a year ago

Sure, btw, I think my code is faster, you can check it. ;)

jcarlos Over a year ago

ok, I´ll try to build a comparison test here with the real data frame. thank you.

jezrael Over a year ago

Give me know, I am really interested. Because obviously groupby is slow.

jcarlos Over a year ago

your solution is the second fastest for some tests I did here. I used my data with aprox. 150.000 records and It gets very close to piRSquared solution when I remove the sort at the end. Is that really necessary? The concat solution was the slowest.

jezrael Over a year ago

No, sort_index in the end is not necessary, only nicer output ;)

Jarek · Accepted Answer · 2017-01-07 15:58:18Z

How to get 'temp' row number and its value to a tmp column?

Value column is not propagating because you filter it out at the beginning: df[['col1','col2']]. Hence, this is fixed by changing it to tmp = df.drop_duplicates(['col1', 'col2']).

Index is preserved in the index column, if you want to copy it explicitly into data column, just do tmp['index'] = tmp.index.

What is the clever pythonic way to achieve the result below from df?

I do not know if it is particularly clever or not, as this is subjective, but one way of achieving that is

pd.concat([gr.assign(code='CODE{}'.format(min(gr.index))) for _, gr in df.groupby(['col1', 'col2'])])

Finally, to achieve the result in a form you specified, you can add .sort_index() and [['code', 'col1', 'col2', 'value']] to the above, in order to specify ordering of columns. Giving:

newdf = pd.concat([gr.assign(code='CODE{}'.format(min(gr.index))) for _, gr in df.groupby(['col1', 'col2'])]).sort_index()[['code', 'col1', 'col2', 'value']]

Possible performance bottleneck may be groupby and concat which may matter if you operate on large data sets.

Hrvoje · Accepted Answer · 2019-11-26 06:39:24Z

If you have df DataFrame like this:

    state       year    population
0   California  2000    33871648
1   California  2010    37253956
2   New York    2000    18976457
3   New York    2010    19378102
4   Texas       2000    20851820
5   Texas       2010    25145561

you can create indexes from state and year columns with:

df2 = df.set_index(['state','year'])

which will give you dataframe with multi-index constructed from columns state and year:

Accessing Multindexed dataframe

df['California',2000]
Result: 33871648

df[:,2010]
Result:
state
California    37253956
New York      19378102
Texas         25145561
dtype: int64


pop.loc['California':'New York']
Result:
state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
dtype: int64

Collectives™ on Stack Overflow

How to build index from multiple columns and set to a column pandas data frame?

4 Answers 4

1 Comment

5 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

5 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related